Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
Chandra is an advanced optical character recognition model developed by Datalab that converts images and PDFs into structured HTML, Markdown, and JSON formats while preserving document layout. It excels at handling the document elements that traditionally cause the most errors: complex nested tables, mathematical equations, handwritten text, form fields with checkboxes, and mixed-language documents. With support for 90+ languages across Latin, CJK, Arabic, Devanagari, and Cyrillic scripts, Chandra achieves 86.7% overall accuracy on diverse document benchmarks, topping the external olmOCR benchmark. Features two inference modes: local via HuggingFace backend for privacy-sensitive documents and remote via vLLM server achieving approximately 1.44 pages/second on NVIDIA H100 hardware. Installation via pip install chandra-ocr with optional extras for HuggingFace backend and Streamlit web interface. The Chandra 2 release in March 2026 brought significant improvements in table recognition accuracy and processing speed. Code is Apache 2.0 licensed while model weights use a modified OpenRAIL-M license permitting research and personal use with commercial licensing available.