Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
MinerU is an open-source document parsing engine from OpenDataLab that converts PDFs, Office files, images, and web pages into LLM-ready Markdown and JSON. Originally built during the pre-training of InternLM to wrangle scientific PDFs, it has matured into a production-grade extraction toolchain used to feed retrieval-augmented generation pipelines and agentic workflows. The project has crossed 63,000 GitHub stars and 5,300+ forks, and the recent license change from AGPLv3 to a custom Apache-2.0-derived MinerU Open Source License has removed the biggest barrier to enterprise adoption. ## What MinerU Solves Feeding documents to LLMs sounds trivial until you try. Real-world PDFs contain multi-column layouts, rotated tables, vector math, scanned pages mixed with born-digital text, headers and footers that leak into context windows, and figures that lose all meaning without their captions. MinerU is built to handle exactly these cases. It runs layout analysis, OCR fallback, formula recognition, and table extraction in one pipeline, then emits a clean Markdown document with reading order preserved, tables in HTML, formulas in LaTeX, and figures linked with descriptive captions. The same source PDF that produces unusable noise from a naive text-extraction library becomes a structured document an LLM can actually reason over. ## Inputs, Outputs, and Coverage MinerU accepts PDFs, DOCX, PPTX, XLSX, common image formats, and web pages. For Office formats it parses natively rather than round-tripping through a PDF conversion step, which preserves more structure. Outputs include a Markdown variant tuned for multimodal LLMs that retains image references, an NLP-oriented Markdown variant optimized for chunking and embedding, and a reading-order-sorted JSON representation with full bounding boxes and per-block metadata. OCR is automatically triggered on scanned or garbled pages and supports 109 languages, with the project's roots in Chinese scientific literature giving it unusually strong handling of vertical text, seals, and mixed-script documents. ## Backends and Models Three inference backends ship in the box. The Pipeline backend uses traditional layout models and runs on CPU, scoring 86.2 on OmniDocBench v1.5 with modest hardware requirements. The VLM-Engine backend serves the new MinerU2.5-Pro-2604-1.2B vision-language model through vLLM or LMDeploy, trading more GPU for higher accuracy on dense scientific layouts, image and chart recognition, and cross-page table merging. The Hybrid-Engine combines native text extraction with selective VLM passes to suppress hallucinations on text that can be read directly, which is the most production-friendly default for mixed corpora. ## Deployment Surface MinerU is unusually thorough on the deployment side. A desktop client, a web UI, Python and Go and TypeScript SDKs, a REST API, a Docker image, and an MCP server are all maintained by the core team. The MCP integration makes the parser callable from Claude, Cursor, and Windsurf without writing glue code. Native compatibility with LangChain, Dify, and FastGPT means the output drops directly into existing RAG stacks. There is also explicit support for more than ten domestic Chinese AI accelerators, including Ascend, Cambricon, and Moore Threads, alongside CUDA GPUs and Apple Silicon. ## Performance and Hardware The pipeline backend will run comfortably on a 16GB RAM laptop with no GPU. VLM backends need 2GB to 8GB of VRAM depending on backend choice, with Volta or newer NVIDIA GPUs or Apple Silicon recommended. Multi-threaded concurrent inference and thread-safe APIs are first-class concerns, which matters for high-throughput document ingestion pipelines. ## Limitations A few sharp edges remain. The custom MinerU Open Source License is friendlier than AGPLv3 but is not OSI-approved, so legal teams should still read it. The latest VLM checkpoints carry their own model-license terms separate from the code license. Very long documents with deeply nested layouts can still produce reading-order glitches that require human cleanup. Comparisons to Marker, Docling, and Unstructured.io vary by document type, so teams running MinerU at scale typically benchmark on a representative slice of their own corpus rather than trusting any single public score.