Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
PageIndex is an open-source document indexing framework from VectifyAI that abandons vector embeddings entirely and instead uses LLM reasoning to navigate hierarchical tree indices, providing the long-context retrieval backbone that modern AI agents need. With over 30,800 GitHub stars and 2,600 forks, the project has emerged as a leading alternative to traditional vector RAG for agentic workflows that operate over complex professional documents. ## Why Vectorless RAG Matters for Agents Vector-based RAG pipelines have been the default infrastructure for AI assistants, but they break down on long, structured, domain-specific documents. Embeddings collapse rich hierarchical context into flat similarity scores, artificial chunking destroys document structure, and retrieval is opaque, making it hard for agents to justify their citations. PageIndex was designed specifically to fix these failure modes for agentic systems that need to reason over financial reports, regulatory filings, legal documents, and academic textbooks. Instead of similarity search, PageIndex builds a semantic tree that resembles a machine-optimized table of contents. Each node carries an LLM-generated summary, page range, and node ID. At query time, an LLM agent traverses the tree by reasoning, picking the most relevant branches based on logical inference rather than vector proximity. The result is explainable retrieval with traceable page references that agents can cite directly. ## Core Architecture PageIndex operates in two phases. The Index Generation phase parses a document, extracts its natural hierarchy, and builds a semantic tree with summarized nodes preserving metadata. The Tree Search Retrieval phase exposes that tree to an LLM, which navigates it through structured reasoning to surface relevant sections. Because the tree mirrors the document's actual structure, no artificial chunking is required and nothing is lost across chunk boundaries. ## FinanceBench Performance Mafin 2.5, a reasoning-based RAG system built on PageIndex, achieved 98.7% accuracy on the FinanceBench benchmark, significantly outperforming traditional vector-based RAG systems on financial document analysis. This result demonstrates that for documents where structure and reasoning matter more than raw semantic similarity, vectorless retrieval can be dramatically more accurate. ## Agent Integration PageIndex is delivered as a Python library with multiple deployment options: self-hosted open-source, a cloud-based chat platform, an MCP server for agent integration, and a hosted API. The MCP integration is particularly important for agentic workflows because it lets agent frameworks like Claude Code, Cursor, or custom LangGraph pipelines plug into PageIndex as a structured retrieval tool with no embedding infrastructure required. ## Supported Document Types The framework targets long-form professional documents including financial reports and SEC filings, regulatory documentation, academic textbooks, legal and technical manuals, long-form PDFs exceeding typical LLM context windows, and properly formatted Markdown files. Cloud variants include enhanced OCR for scanned and complex PDFs. ## Limitations PageIndex assumes documents have meaningful hierarchical structure; flat or poorly formatted documents do not benefit as much. Tree traversal requires multiple LLM calls per query, which is slower and more expensive than a single vector lookup, especially for simple factual queries. The framework is optimized for retrieval quality on complex documents, not for high-QPS consumer search workloads. Building the index for very large corpora still requires significant LLM throughput up front.
OpenClaw is an open-source, local-first AI gateway with 366K GitHub stars that routes AI responses through WhatsApp, Telegram, Slack, Discord, iMessage, Teams, and 15+ other platforms — zero cloud dependency.
OpenClaw
Open-source personal AI assistant connecting to 13+ messaging platforms with local gateway architecture, voice support, and multi-agent routing.