Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
LiteParse is run-llama's new open-source document parser, built as a fast, local-first alternative to cloud parsing services. Rather than chasing every exotic layout, it focuses on doing standard parsing extremely well: high-quality spatial text extraction with bounding boxes, optional OCR, and clean JSON or text output, all running on the user's own machine with no calls to proprietary LLM APIs. The project crossed 7,000 GitHub stars within weeks of launch and has become a fixture in self-hosted RAG and agent stacks. ## Why a New Parser The document parsing space is bifurcated. On one end sit minimal libraries that pull text but lose layout. On the other sit cloud services and large vision-language stacks that do beautiful structured extraction but require API keys, network access, and per-page costs. LiteParse occupies the middle, delivering production-quality spatial parsing locally and quickly. The maintainers are explicit about the trade-off: for very complex documents (dense tables, charts, handwritten text), the cloud-based LlamaParse still does better, but for the long tail of normal documents, LiteParse is meant to be the default. ## Rust Core, Multi-Language Bindings The codebase is 70% Rust, with the core library and CLI built around a Rust workspace that wraps PDFium for rendering and text extraction. From that core, the project publishes first-class bindings for Node.js and TypeScript via @llamaindex/liteparse, Python via pip install liteparse, native Rust via cargo, and a WebAssembly build via @llamaindex/liteparse-wasm that runs directly in browsers. Bindings are generated with napi-rs, PyO3, and wasm-bindgen, so they expose the same fast core to every common runtime. ## Output: Spatial Text and Screenshots A defining feature is that LiteParse emits spatial text with bounding boxes alongside cleaned plain text. That makes it natural to feed into retrieval pipelines that need page positions, or into vision-language models that need page screenshots. The CLI can generate high-DPI page screenshots specifically for LLM agents that want to see what they are reading, not just the OCR. ## Flexible OCR Tesseract ships bundled with the library so basic OCR works out of the box with zero setup. Beyond that, LiteParse defines an OCR API specification (OCR_API_SPEC.md) that lets users plug in EasyOCR, PaddleOCR, or custom OCR servers via simple HTTP endpoints. This separation is pragmatic: teams that already run a tuned PaddleOCR cluster can keep using it; teams that just want something to work can ignore the layer. ## Format Coverage Through LibreOffice and ImageMagick, LiteParse parses Word, PowerPoint, and spreadsheet families (.docx, .pptx, .xlsx, .odt, .rtf, .pages, .key, .numbers, .csv) as well as common image formats (JPG, PNG, TIFF, WEBP, SVG). Internally everything routes through PDF before being parsed, which keeps the core pipeline narrow and well tested. ## CLI and Production Features The CLI supports batch directory processing, page-range targeting, OCR language configuration, screenshot generation with DPI customization, encryption password handling, and tunable concurrent workers. The repository ships Docker images (Docker v2.0.3 landed on 2026-05-28) for teams that want to drop LiteParse into existing data pipelines without compiling Rust. ## Where It Fits LiteParse is the right pick when you need predictable, fast, local PDF and document parsing for RAG, agents, or back-office automation. The fact that the core is Rust and that the WASM build runs in browsers makes it unusually flexible: you can parse on a beefy server, on a developer laptop, or directly inside a web app without round-tripping documents to a cloud service. ## Limitations LiteParse is deliberately scoped. It does not attempt to parse dense multi-column scientific layouts, complex tables, charts, or handwritten content as well as VLM-based parsers or LlamaParse. OCR quality outside the bundled Tesseract path depends on what backend you wire up. And while bindings exist for many languages, the canonical examples and documentation lean Python and Node first, with Rust users expected to read the workspace directly.