Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
OpenAI Evals is an open-source framework for evaluating large language models and LLM-based systems, accompanied by a community-driven registry of benchmarks. With 17,900 GitHub stars and 2,900 forks, it has become a foundational tool for anyone building or deploying LLM applications who needs systematic evaluation methodology. The framework is released under the MIT license. ## Why Evals Matter As OpenAI states in the project documentation: "Without evals, it can be very difficult and time intensive to understand how different model versions might affect your use case." This core insight drives the entire framework. When upgrading models, changing prompts, or modifying system architecture, developers need a reliable way to measure whether these changes improve or degrade performance for their specific applications. Evals addresses this by providing a structured approach to defining test cases, running them across models, and comparing results quantitatively. Rather than relying on ad-hoc manual testing, teams can build repeatable evaluation suites that catch regressions early. ## Framework Architecture The framework operates through several key components. Eval templates provide ready-made evaluation patterns that require no custom code. Users specify their test data and criteria, and the framework handles execution and scoring. For more nuanced evaluation, model-graded evals use an LLM as a judge to assess the quality of another model's outputs, enabling evaluation of subjective qualities like helpfulness, accuracy, and coherence. For advanced use cases, the Solvers framework (currently in beta) provides a more flexible abstraction for evaluating complex LLM systems beyond simple input-output pairs. This supports evaluation of multi-step agent workflows, tool-using systems, and chain-of-thought reasoning. ## Open Benchmark Registry The project maintains a community-contributed registry of evaluation benchmarks covering diverse capabilities: reasoning, coding, mathematics, factual knowledge, instruction following, safety, and more. This registry serves as both a resource for testing models and a standard reference for the research community. Contributors can submit new evaluations through pull requests, building a shared library of testing methodology. ## Cross-Model Evaluation While built by OpenAI, the framework supports evaluating any model that exposes a Chat Completions-compatible API. This means teams can benchmark third-party models, locally hosted models via Ollama or vLLM, and fine-tuned variants against the same evaluation suite. This cross-model capability makes it valuable for organizations comparing providers or validating that a smaller, cheaper model meets their quality threshold. ## Dashboard Integration As of 2026, OpenAI has integrated Evals directly into the OpenAI Dashboard, allowing users to configure and run evaluations through a graphical interface without writing code. The hosted evals product provides an API for programmatic access, making it easier to integrate evaluation into CI/CD pipelines and automated testing workflows. ## Getting Started Installation is via pip (`pip install evals`) with Python 3.9 or later required. Git-LFS is needed for full registry access. The repository includes comprehensive documentation, example evaluations, and a contribution guide for submitting new benchmarks.

Shubhamsaboo
Collection of 100+ production-ready LLM apps with AI agents, RAG, voice agents, and MCP using OpenAI, Anthropic, Gemini, and open-source models
infiniflow
Leading open-source RAG engine with deep document understanding, grounded citations, and agent capabilities, with 73K+ GitHub stars.