Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
olmOCR is an open-source toolkit from the Allen Institute for AI (Ai2) for converting PDFs and image-based documents into clean, readable plain text and Markdown. Built around a 7B-parameter vision-language model and released under the Apache 2.0 license, it has grown past 18,000 GitHub stars as a go-to solution for preparing document collections for LLM training datasets and RAG pipelines. ## Vision-Language OCR, Not Rule-Based Parsing Rather than relying on brittle layout heuristics, olmOCR uses a fine-tuned VLM to read pages the way a human would. It handles equations, tables, handwriting, and complex formatting, reconstructs a natural reading order even across multi-column layouts, figures, and insets, and automatically strips headers and footers. Input can be PDF, PNG, or JPEG; output is clean Markdown ready for downstream use. ## Cost-Efficient at Scale The pipeline is engineered for large corpora: Ai2 reports conversion costs under $200 per million pages using the vLLM-based inference pipeline. Docker images are officially supported, and the toolkit includes batching and retry logic tuned for processing millions of documents. A GPU is required since inference runs a 7B VLM. ## olmOCR 2 and an Open Benchmark The olmOCR-2 model release, trained with synthetic data and reinforcement learning, lifted accuracy by roughly four points on olmOCR-Bench — the project's own benchmark suite covering more than 7,000 test cases across 1,400 documents. Both the models (published on Hugging Face in FP8) and the full training code are open, so teams can fine-tune their own OCR models rather than treating the system as a black box. ## Considerations Running a 7B VLM means olmOCR needs GPU hardware, and throughput on a single consumer card is modest compared with cloud OCR APIs — its economics shine at batch scale rather than for one-off documents. Purely digital PDFs with embedded text may not need VLM-based OCR at all. For research groups and companies building LLM training corpora or document-heavy RAG systems from scanned or complex PDFs, though, olmOCR offers state-of-the-art quality with fully open models, code, and benchmarks.