Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
RAGFlow is a leading open-source Retrieval-Augmented Generation engine that fuses cutting-edge RAG capabilities with agent features to create a superior context layer for large language models. With 73,400 GitHub stars and a focus on deep document understanding, it addresses the critical challenge of accurate knowledge retrieval that many RAG implementations struggle with. ## The Document Understanding Problem Most RAG systems treat document parsing as a simple preprocessing step, applying basic text extraction and fixed-size chunking before feeding content into vector databases. This approach fails catastrophically with complex documents containing tables, images, mixed layouts, multi-column formats, and nested structures. RAGFlow was built specifically to solve this problem by making deep document understanding its core differentiator. The engine processes documents through intelligent analysis that recognizes and preserves document structure, ensuring that tables remain as tables, lists maintain their hierarchy, and images are properly associated with their surrounding text. This structural awareness translates directly into more accurate retrieval results. ## Template-Based Chunking RAGFlow offers a template-based chunking system that goes beyond simple text splitting. Users can choose from multiple chunking strategies optimized for different document types: academic papers, legal contracts, technical documentation, financial reports, and general-purpose content. Each template understands the typical structure of its document type and chunks accordingly, preserving semantic boundaries that naive splitting would break. The chunking process is explainable, meaning users can inspect exactly how each document was parsed and segmented. This transparency is critical for enterprise deployments where teams need to understand and audit the retrieval pipeline. ## Grounded Citations and Reduced Hallucination One of RAGFlow's most valued features is its grounded citation system. Every generated answer includes traceable references back to the original source documents, with visual highlighting that shows exactly which passages informed the response. This citation mechanism significantly reduces hallucination by making it immediately apparent when a response lacks proper grounding in the source material. For enterprise use cases where accuracy is non-negotiable, such as legal research, medical information retrieval, and financial analysis, this traceability provides the accountability layer that organizations require. ## Multi-Format Document Support RAGFlow handles an extensive range of document formats including Word documents, PowerPoint slides, Excel spreadsheets, images with OCR, scanned PDFs, structured data files, and web content. Recent updates have added integration with MinerU and Docling document parsers, further expanding the range of documents that can be accurately processed. The platform also supports data synchronization from external sources including Confluence, Amazon S3, Notion, Discord, and Google Drive, enabling teams to build knowledge bases that stay current with their existing document repositories. ## Agent Capabilities Beyond pure retrieval, RAGFlow includes pre-built agent templates and an agentic workflow system that allows developers to create multi-step reasoning pipelines. Recent updates added memory support for agents, enabling persistent context across conversation sessions. The agent system can combine retrieval with external tool calls, calculations, and multi-hop reasoning. ## Streamlined Enterprise Deployment RAGFlow is designed for enterprise scale from the ground up. The orchestration pipeline automatically handles document ingestion, parsing, chunking, embedding, indexing, and retrieval without requiring manual configuration for each step. The system scales horizontally and supports both CPU and GPU inference for embedding models. The current stable release is v0.24.0, with active development continuing across features like improved Gemini 3 Pro support, enhanced data synchronization, and more sophisticated agentic workflows. ## Technical Foundation Built primarily in Python with an Apache 2.0 license, RAGFlow leverages InfiniFlow's deep expertise in information retrieval and database systems. The engine uses a combination of sparse and dense retrieval methods, with support for hybrid search that combines keyword matching with semantic similarity for optimal recall and precision.

Shubhamsaboo
Collection of 100+ production-ready LLM apps with AI agents, RAG, voice agents, and MCP using OpenAI, Anthropic, Gemini, and open-source models
Unsloth AI
Open-source LLM fine-tuning optimizer delivering 2x faster training with 70% less VRAM, supporting models from GPT-OSS to DeepSeek with zero accuracy loss.