Reviews AI Tools Open Source Live News AI Official

Open Source

Explore the latest AI open-source projects from GitHub and HuggingFace.

MinerU - Open Source | Evermx | Evermx

Back to Open Source

TrendingFeatured

MinerU

opendatalabMinerU Open Source License (Apache 2.0 based)

View on GitHub

Other63.2K Stars5.3K Forks1 views

MinerU is an open-source document parsing engine from OpenDataLab that converts PDFs, Office files, images, and web pages into LLM-ready Markdown and JSON. Originally built during the pre-training of InternLM to wrangle scientific PDFs, it has matured into a production-grade extraction toolchain used to feed retrieval-augmented generation pipelines and agentic workflows. The project has crossed 63,000 GitHub stars and 5,300+ forks, and the recent license change from AGPLv3 to a custom Apache-2.0-derived MinerU Open Source License has removed the biggest barrier to enterprise adoption. ## What MinerU Solves Feeding documents to LLMs sounds trivial until you try. Real-world PDFs contain multi-column layouts, rotated tables, vector math, scanned pages mixed with born-digital text, headers and footers that leak into context windows, and figures that lose all meaning without their captions. MinerU is built to handle exactly these cases. It runs layout analysis, OCR fallback, formula recognition, and table extraction in one pipeline, then emits a clean Markdown document with reading order preserved, tables in HTML, formulas in LaTeX, and figures linked with descriptive captions. The same source PDF that produces unusable noise from a naive text-extraction library becomes a structured document an LLM can actually reason over. ## Inputs, Outputs, and Coverage MinerU accepts PDFs, DOCX, PPTX, XLSX, common image formats, and web pages. For Office formats it parses natively rather than round-tripping through a PDF conversion step, which preserves more structure. Outputs include a Markdown variant tuned for multimodal LLMs that retains image references, an NLP-oriented Markdown variant optimized for chunking and embedding, and a reading-order-sorted JSON representation with full bounding boxes and per-block metadata. OCR is automatically triggered on scanned or garbled pages and supports 109 languages, with the project's roots in Chinese scientific literature giving it unusually strong handling of vertical text, seals, and mixed-script documents. ## Backends and Models Three inference backends ship in the box. The Pipeline backend uses traditional layout models and runs on CPU, scoring 86.2 on OmniDocBench v1.5 with modest hardware requirements. The VLM-Engine backend serves the new MinerU2.5-Pro-2604-1.2B vision-language model through vLLM or LMDeploy, trading more GPU for higher accuracy on dense scientific layouts, image and chart recognition, and cross-page table merging. The Hybrid-Engine combines native text extraction with selective VLM passes to suppress hallucinations on text that can be read directly, which is the most production-friendly default for mixed corpora. ## Deployment Surface MinerU is unusually thorough on the deployment side. A desktop client, a web UI, Python and Go and TypeScript SDKs, a REST API, a Docker image, and an MCP server are all maintained by the core team. The MCP integration makes the parser callable from Claude, Cursor, and Windsurf without writing glue code. Native compatibility with LangChain, Dify, and FastGPT means the output drops directly into existing RAG stacks. There is also explicit support for more than ten domestic Chinese AI accelerators, including Ascend, Cambricon, and Moore Threads, alongside CUDA GPUs and Apple Silicon. ## Performance and Hardware The pipeline backend will run comfortably on a 16GB RAM laptop with no GPU. VLM backends need 2GB to 8GB of VRAM depending on backend choice, with Volta or newer NVIDIA GPUs or Apple Silicon recommended. Multi-threaded concurrent inference and thread-safe APIs are first-class concerns, which matters for high-throughput document ingestion pipelines. ## Limitations A few sharp edges remain. The custom MinerU Open Source License is friendlier than AGPLv3 but is not OSI-approved, so legal teams should still read it. The latest VLM checkpoints carry their own model-license terms separate from the code license. Very long documents with deeply nested layouts can still produce reading-order glitches that require human cleanup. Comparisons to Marker, Docling, and Unstructured.io vary by document type, so teams running MinerU at scale typically benchmark on a representative slice of their own corpus rather than trusting any single public score.

Key Features

Multi-format ingestion for PDF, DOCX, PPTX, XLSX, images, and web pages
Three backends: CPU Pipeline, GPU VLM-Engine, and Hybrid-Engine for mixed corpora
MinerU2.5-Pro-2604-1.2B vision-language model with chart, formula, and cross-page table support
Automatic OCR fallback covering 109 languages including vertical and mixed-script text
Reading-order-sorted JSON output plus multimodal and NLP-optimized Markdown variants
MCP server for direct integration with Claude, Cursor, and Windsurf
Native compatibility with LangChain, Dify, and FastGPT RAG stacks
Desktop client, web UI, REST API, Docker image, and Python/Go/TypeScript SDKs
Hardware support spanning CUDA GPUs, Apple Silicon, and 10+ domestic AI accelerators

Related Projects

TrendingOther

GitHub

145.5K8.6K

Langflow

langflow-ai

MIT329

Open Source

MinerU

Key Features

Tags

Related Projects

Langflow

MarkItDown

Firecrawl

MarkItDown