Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
Marker is an open-source document conversion engine from Datalab that turns PDFs, images, PPTX, DOCX, XLSX, HTML, and EPUB files into clean Markdown, JSON, chunks, and HTML. With 36,014 GitHub stars under the GPL-3.0 code license, it has become one of the most widely adopted tools for the unglamorous but foundational task of getting messy real-world documents into a structured form that LLMs, RAG pipelines, and data workflows can actually use. ## The Problem It Solves Most enterprise knowledge still lives in PDFs and office documents that were designed for human eyes, not machines: multi-column layouts, scanned pages, footnotes, equations, nested tables, headers and footers that repeat on every page. Naive text extraction produces a jumble of out-of-order fragments, broken tables, and OCR noise that poisons any downstream retrieval or summarization step. Marker exists to close that gap. It parses document structure rather than just scraping characters, reconstructing reading order, preserving tables and math, and emitting output that maps cleanly onto the chunking and embedding stages of a retrieval pipeline. ## Broad Format and Language Coverage Marker is not a PDF-only tool. It converts PDF, image, PPTX, DOCX, XLSX, HTML, and EPUB files across all languages, which makes it a single dependency for teams that previously stitched together separate libraries for each file type. It formats tables, forms, equations, inline math, links, references, and code blocks; extracts and saves embedded images; and removes headers, footers, and other layout artifacts that would otherwise contaminate the text. For documents with bad or missing text layers, a `--force_ocr` flag runs OCR across all lines, while `strip_existing_ocr` keeps digital text and discards prior low-quality OCR. ## Structured Extraction and Hybrid LLM Mode Beyond plain conversion, Marker supports structured extraction against a user-supplied JSON schema, letting teams pull specific fields out of forms and reports directly into typed data. Its hybrid mode, enabled with `--use_llm`, brings a language model into the loop to handle the hardest cases: merging tables that span page breaks, formatting inline math as LaTeX, and extracting values from forms. The mode works with Gemini or Ollama models and defaults to `gemini-2.0-flash`. Datalab's published benchmarks show this hybrid path delivering higher table accuracy than either Marker alone or a general-purpose multimodal model alone, which is the practical sweet spot for accuracy-sensitive document processing. ## Performance and Hardware Flexibility Marker runs on GPU, CPU, or Apple Silicon MPS, with the torch device auto-detected and overridable via `TORCH_DEVICE`. Single-page serial throughput already benchmarks favorably against cloud services like LlamaParse and Mathpix as well as other open-source tools, but the real advantage shows in batch mode, where Datalab projects throughput of roughly 25 pages per second on an H100. That batch performance is what makes Marker viable as the ingestion layer for large document corpora rather than just an occasional one-off converter. ## Installation and Usage The tool installs as a single pip package, `marker-pdf`, with a `[full]` extra for non-PDF formats. A command-line entry point, `marker_single`, converts one file, while a Streamlit-based `marker_gui` offers an interactive way to try options without writing code. This low-friction onboarding — `pip install`, then one command — is a large part of why Marker spread so quickly among developers building RAG systems who needed a dependable preprocessing step without standing up a service. ## Licensing Considerations Marker's code is licensed under GPL-3.0, and its model weights use a modified AI Pubs Open RAIL-M license that is free for research, personal use, and startups under \$2M in funding or revenue. Broader commercial use, or removing the GPL obligations, requires a commercial license from Datalab, which also offers a managed platform running their newer Chandra model with zero data retention by default and SOC 2 Type 2 compliance. Teams evaluating Marker should weigh the GPL and RAIL-M terms against their deployment model; for open-source projects, research, and early-stage startups, the free terms are generous, while larger commercial deployments need to factor in licensing. ## Why It Matters Document ingestion is the quiet bottleneck of most production AI systems — the quality of every answer a RAG system gives is capped by the quality of the text that went into it. With 36,014 stars and 2,488 forks, Marker has become a default answer to that problem in the open-source ecosystem: broad format coverage, structure-aware parsing, optional LLM-assisted accuracy, and throughput high enough for real corpora. In 2026, as more teams build retrieval pipelines over their own documents, Marker occupies the position of dependable, well-benchmarked infrastructure at the very start of that pipeline.