Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
PaddleOCR is a global-leading open-source OCR toolkit and Document AI engine. It converts PDF documents and images into structured, LLM-ready data in JSON and Markdown with industry-leading accuracy, serving as the bedrock for intelligent RAG and agentic applications. ## Why PaddleOCR Matters Real-world documents are messy: scanned PDFs, tables, formulas, seals, charts, and dozens of languages mixed together. PaddleOCR turns that visual chaos into clean, structured data an LLM can actually use. With 80,000+ GitHub stars and adoption by top-tier projects such as Dify, RAGFlow, and Cherry Studio, it has become a default building block for document-centric AI pipelines. ## SOTA Document Vision-Language Model At the core is PaddleOCR-VL-1.6, a lightweight 0.9B vision-language model purpose-built for document parsing. It reaches 96.3% accuracy on OmniDocBench v1.6 and leads in text, formula, and table recognition, with markedly stronger handling of ancient documents, rare characters, seals, and charts — all emitted as structured Markdown and JSON. ## Structure-Aware Document Conversion PP-StructureV3 provides structure-aware conversion that turns complex PDFs and images into Markdown or JSON while preserving layout. It offers finer-grained control over reading order, tables, and nested elements, making the output reliable for retrieval-augmented generation rather than a flat dump of text. ## Broad Language and Hardware Support PaddleOCR supports 100+ languages and runs across CPU, GPU, XPU, and NPU hardware on Linux, Windows, and macOS. The toolkit spans the full pipeline from text detection and recognition to key information extraction and document translation, so teams can deploy on-premise without locking into a proprietary cloud API. ## A Mature, Trusted Ecosystem Beyond raw models, PaddleOCR ships training tools, pretrained pipelines, and integrations used by thousands of downstream repositories. Its Apache-2.0 license and active maintenance make it a dependable foundation for production document intelligence.