Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
PageIndex is an open-source framework by VectifyAI that replaces traditional vector-database RAG with a vectorless, reasoning-based retrieval approach. Instead of chunking documents and performing similarity search, PageIndex converts PDFs and markdown files into hierarchical semantic tree structures and uses LLM reasoning to navigate them, much like an expert scanning a table of contents. The system achieved 98.7% accuracy on FinanceBench through its Mafin 2.5 financial RAG implementation, significantly outperforming conventional vector-based solutions on professional document analysis tasks. ## Why PageIndex Matters Retrieval-Augmented Generation has become the standard approach for grounding LLM outputs in external documents, but conventional RAG pipelines suffer from well-documented limitations. Chunking strategies fragment context across arbitrary boundaries, embedding models lose nuance in semantic compression, and vector similarity search often retrieves superficially related but contextually irrelevant passages. PageIndex addresses all three problems simultaneously by eliminating the vector pipeline entirely. The key insight is that documents already have structure. Financial reports have sections, subsections, tables, and appendices organized in a logical hierarchy. Rather than destroying this structure through chunking, PageIndex preserves it and teaches an LLM to navigate it the way a human analyst would, following the document's own organizational logic to locate relevant information. ## Tree-Based Document Indexing The indexing phase converts each document into a semantic tree structure. This is not a simple table of contents extraction. PageIndex analyzes the logical relationships between sections, identifies parent-child hierarchies, and creates navigable tree nodes with summary metadata at each level. The resulting tree captures both the structural organization and the semantic content of the document, enabling precise traversal during retrieval. Each node in the tree contains a summary of its content scope, references to child nodes, and page-level pointers back to the original document. This means retrieval results always come with exact page citations, providing full traceability that vector-based systems cannot match. ## LLM-Powered Reasoning Retrieval During retrieval, PageIndex uses an LLM to reason about which branches of the document tree are most relevant to a given query. Inspired by AlphaGo's Monte Carlo Tree Search, the system evaluates potential paths through the document hierarchy, selecting the most promising branches for deeper exploration. This approach is inherently more context-aware than cosine similarity because the LLM can understand the relationship between the query intent and the document's organizational structure. The reasoning process is fully explainable. Each retrieval decision produces a trace showing which branches were explored, which were pruned, and why specific sections were selected. This level of transparency is critical for professional applications where users need to verify the basis for AI-generated answers. ## 98.7% Accuracy on FinanceBench The most compelling evidence for PageIndex's effectiveness comes from its performance on FinanceBench, a benchmark designed to evaluate RAG systems on realistic financial document questions. Mafin 2.5, a financial analysis system built on PageIndex, achieved 98.7% accuracy, establishing a new state-of-the-art for document analysis. This result demonstrates that reasoning-based retrieval can dramatically outperform similarity-based approaches when documents have rich structural content. FinanceBench questions require understanding of multi-section financial reports, cross-referencing between tables and narrative text, and precise numerical extraction. These are exactly the tasks where traditional chunking-based RAG struggles most, because relevant information is often distributed across multiple non-contiguous sections. ## Deployment Flexibility PageIndex supports multiple deployment modes. Self-hosted installation requires only Python 3 and an OpenAI API key, making it accessible for individual developers and small teams. A cloud service provides a ChatGPT-style interface optimized for document analysis, with MCP integration for tool-calling workflows. Enterprise customers can deploy on-premises with private model configurations. The framework is model-agnostic at the reasoning layer. While defaults use OpenAI models, the tree search and reasoning components can work with any LLM that supports structured output, including open-source alternatives. ## Practical Applications PageIndex excels in domains with long, structured documents: financial analysis of SEC filings and annual reports, legal review of contracts and regulatory filings, academic research across multi-chapter textbooks, and technical documentation spanning hundreds of pages. Any use case where documents exceed LLM context windows and contain hierarchical organization benefits from this approach. ## Limitations The reasoning-based approach incurs higher per-query LLM costs compared to vector similarity search, since each retrieval requires multiple LLM calls to traverse the tree. Documents without clear structural organization, such as unformatted text dumps, may not benefit significantly from tree-based indexing. The system currently requires OpenAI-compatible APIs, and the self-hosted setup lacks the convenience of managed vector database services like Pinecone or Weaviate.