Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
LangExtract is an open-source Python library from Google that uses large language models to extract precisely structured information from unstructured text documents. Released in February 2026 and now at v1.2.1, the library has already accumulated 35,900 stars on GitHub, reflecting strong developer demand for reliable information extraction tooling. The library's standout feature is precise source grounding — every extracted piece of information is mapped back to its exact location in the source document, enabling interactive HTML visualizations that highlight where data came from. This makes LangExtract particularly valuable in high-stakes domains like healthcare, legal, and scientific research where verifiability is critical. To handle large documents efficiently, LangExtract employs an optimized strategy of text chunking and parallel processing with multiple passes for higher recall — solving the classic "needle in a haystack" challenge of extracting information from lengthy texts. The library enforces structured outputs using schema definitions and few-shot examples, leveraging Controlled Generation in supported models like Gemini to ensure consistent, schema-compliant results every time. LangExtract supports a wide range of LLM backends beyond Gemini, including OpenAI models and local models via Ollama, making it provider-agnostic. Real-world applications demonstrated include clinical information extraction from medical notes, radiology report structuring, medication extraction with dosage and relationship mapping, and full-text literary analysis. It is installable via PyPI with `pip install langextract`.