Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
LangExtract is an open-source Python library by Google that uses large language models to extract structured information from unstructured text with precise source grounding and interactive visualization. With 32.6k GitHub stars and Apache 2.0 licensing, it has become the go-to framework for developers who need reliable, traceable data extraction from documents at scale. ## The Problem with Unstructured Data Organizations sit on vast amounts of unstructured text: clinical notes, financial reports, legal contracts, research papers, customer feedback. Extracting actionable information from these documents has traditionally required either manual review or custom NLP pipelines that are expensive to build and fragile to maintain. LLMs changed the equation by making it possible to extract information with natural language instructions rather than hand-coded rules. But raw LLM extraction has a critical weakness: there is no guarantee that the extracted data actually appears in the source document. Hallucinated facts can slip into extraction results undetected. LangExtract solves this by enforcing source grounding. Every extracted piece of information is mapped back to its exact location in the source text, making verification straightforward and hallucination detectable. ## Core Architecture ### Source Grounding The defining feature of LangExtract is its source grounding capability. When the library extracts an entity, attribute, or relationship from a document, it records the precise text span that supports each extraction. This enables visual highlighting in the interactive viewer, allowing users to click on any extracted item and immediately see the source passage that generated it. This traceability transforms LLM extraction from a black-box process into an auditable one. Compliance teams can verify that contract clauses were correctly identified. Clinicians can confirm that medication dosages match the original notes. Researchers can validate that cited claims correspond to actual paper content. ### Controlled Generation LangExtract enforces consistent output schemas using controlled generation for compatible models like Google Gemini. Rather than hoping the LLM returns well-structured JSON, the library constrains the generation process to produce outputs that match user-defined schemas. This eliminates the formatting inconsistencies that plague raw LLM extraction. ### Long Document Optimization Handling documents that exceed LLM context windows is a well-known challenge. LangExtract addresses this with a chunking strategy that splits documents into manageable segments, processes them in parallel across configurable worker threads, and runs multiple extraction passes over smaller focused contexts. This approach tackles the needle-in-a-haystack problem where critical information might be buried deep within a long document. ## Supported LLM Backends LangExtract supports multiple LLM providers. Google Gemini models are the primary supported backend, with gemini-2.5-flash recommended for most use cases and gemini-2.5-pro available for tasks requiring higher accuracy. OpenAI GPT-4o is supported through an optional dependency. Local models can be run through Ollama, with models like gemma2:2b for on-device processing. Enterprise deployments can use Vertex AI with service account authentication and batch processing capabilities. ## Interactive Visualization One of LangExtract's most practical features is its interactive HTML visualization. After extraction, the library generates self-contained HTML files that display the original text alongside extracted entities. Users can browse thousands of extracted items within their original context, making quality review efficient even at scale. The visualization is entirely self-contained, requiring no server or additional software. A single HTML file can be shared with reviewers, auditors, or collaborators who need to verify extraction quality. ## Use Cases in Practice ### Clinical and Medical Extracting medications with dosages, routes, and clinical attributes from free-text clinical notes is one of LangExtract's showcase applications. The source grounding feature is particularly valuable in healthcare, where traceability is not optional but regulatory. ### Radiology Reports The RadExtract demo on HuggingFace demonstrates LangExtract's ability to structure unstructured imaging findings, converting narrative radiology reports into queryable structured data. ### Literature Analysis Character and relationship extraction from full novels demonstrates the library's ability to handle very long documents. The Romeo and Juliet example processes the entire play to identify characters, their attributes, and their relationships. ### Financial and Legal Risk extraction from financial audits and clause identification in legal contracts benefit from the combination of structured output and source grounding, enabling compliance verification at scale. ## Installation and Quick Start Installation is straightforward via pip. A basic extraction can be set up in under 20 lines of Python, requiring only a prompt describing what to extract, a few examples showing the expected output format, and the input text. Docker deployment is also supported for containerized environments. ## Limitations Extraction quality depends heavily on the quality of user-provided examples. Poorly chosen or insufficient examples lead to inconsistent results. The library's primary optimization is for Gemini models, and performance with other backends may vary. Processing very large document collections can incur significant API costs, particularly with premium models. The chunking strategy may occasionally split relevant context across boundaries, potentially missing cross-paragraph relationships. ## Community and Development With 127 commits, 83 open issues, and 40 pull requests, the project shows active and sustained development. The Apache 2.0 license and Google's backing provide confidence in long-term maintenance. The strong HuggingFace presence with demo applications lowers the barrier for new users to evaluate the library.

Shubhamsaboo
Collection of 100+ production-ready LLM apps with AI agents, RAG, voice agents, and MCP using OpenAI, Anthropic, Gemini, and open-source models
infiniflow
Leading open-source RAG engine with deep document understanding, grounded citations, and agent capabilities, with 73K+ GitHub stars.