Reviews AI Tools Open Source Live News AI Official

Open Source

Explore the latest AI open-source projects from GitHub and HuggingFace.

TensorZero - Open Source | Evermx | Evermx

Back to Open Source

Trending

TensorZero

tensorzeroApache-2.0

View on GitHub

Inference11.1K Stars787 Forks274 views

## Introduction TensorZero is an open-source stack for building industrial-grade LLM applications, unifying an LLM gateway, observability, optimization, evaluation, and experimentation into a single cohesive system. With over 11,100 GitHub stars, 780+ forks, and an Apache-2.0 license, TensorZero has achieved the rare distinction of reaching #1 trending repository on GitHub. Backed by a $7.3M seed round and used by organizations ranging from frontier AI startups to Fortune 50 companies, TensorZero addresses a critical gap in the LLM infrastructure landscape: the feedback loop between production inference data and model/prompt optimization. Most LLM applications today operate as open-loop systems, where prompts are manually tuned and model selection is based on benchmarks rather than production metrics. TensorZero closes this loop by collecting structured inference data and feedback, then using that data to systematically optimize prompts, models, and inference strategies over time. ## Architecture and Design TensorZero is built as a modular stack where each component can be adopted incrementally: | Component | Technology | Purpose | |-----------|-----------|--------| | Gateway | Rust | Unified LLM API with <1ms p99 latency at 10k+ QPS | | Observability | ClickHouse | Structured storage of inferences and feedback | | Optimization | Python/Rust | Prompt tuning, model selection, DICL | | Evaluation | Python | Heuristic and LLM-judge benchmarking | | Experimentation | Rust | A/B testing, routing, fallbacks | | UI Dashboard | TypeScript | Visual inference exploration and analysis | The gateway is the entry point, written in Rust for extreme performance. It adds less than 1 millisecond of latency at the 99th percentile while handling over 10,000 queries per second. This is not a Python wrapper around API calls; it is a purpose-built high-performance proxy that handles routing, retries, fallbacks, load balancing, and rate limiting. Inferences and feedback flow into ClickHouse, a columnar database optimized for analytical queries. This choice enables fast aggregation over millions of inference records, which is essential for the optimization and evaluation components. The optimization layer analyzes collected data to improve system performance. The most distinctive feature here is Dynamic In-Context Learning (DICL), which automatically selects relevant historical examples to include in prompts based on the current query, achieving fine-tuning-like improvements without any model training. ## Key Capabilities **Universal LLM Gateway**: TensorZero supports every major LLM provider through a single API: Anthropic, OpenAI, Google (Gemini and Vertex), AWS Bedrock, Azure, DeepSeek, Fireworks, Groq, Together, Mistral, xAI, and self-hosted models via vLLM, TGI, and llama.cpp. Integration requires a single API call or any OpenAI-compatible SDK. **Sub-Millisecond Latency**: The Rust gateway consistently delivers under 1ms of overhead at the 99th percentile. For latency-sensitive applications like chatbots and real-time coding assistants, this means TensorZero adds negligible delay to inference requests. **Dynamic In-Context Learning (DICL)**: DICL is an inference-time optimization that enhances LLM performance by automatically incorporating relevant historical examples into prompts. Unlike fine-tuning, DICL requires no model training, works across providers, and improves as more production data is collected. This is available out of the box with zero configuration beyond enabling it. **Structured Inference**: TensorZero supports tool use, structured JSON outputs, batch inference, embeddings, multimodal inputs (images and files), and caching. Prompt templates and schemas enforce a structured interface between applications and LLMs. **Built-in Experimentation**: A/B testing, routing strategies, and feature flags are first-class concepts. Developers can test new prompts or models against production traffic with statistical rigor, rather than relying on offline benchmarks. **High Availability**: Routing, retries, fallbacks, load balancing, and granular timeouts ensure the system stays operational even when individual providers experience outages. Rate limiting with custom scopes prevents cost overruns. **Cost and Usage Tracking**: Every inference is tagged with cost data, enabling precise billing attribution and budget enforcement across teams, features, or customers. ## Developer Integration TensorZero provides three integration paths: a Python SDK, any OpenAI-compatible SDK, or direct HTTP API calls. A minimal integration looks like: ```python from tensorzero import TensorZeroGateway with TensorZeroGateway("http://localhost:3000") as client: response = client.inference( function_name="generate_summary", input={"messages": [{"role": "user", "content": "Summarize this article..."}]} ) ``` Configuration is declarative via TOML files that define functions, prompts, models, and routing strategies. This makes the system auditable and version-controllable. Feedback collection is equally simple: after an inference, log a metric (boolean, float, or categorical) tied to the inference ID. The optimization system uses this feedback to improve future inferences. ## Limitations TensorZero requires running additional infrastructure (the gateway service and ClickHouse). For simple applications with a single model and no optimization needs, this overhead may not be justified. The declarative configuration, while powerful, requires upfront investment in defining functions, prompts, and variants. The optimization features are most valuable with significant inference volume; small-scale applications may not generate enough data for meaningful improvements. Self-hosted model support depends on OpenAI-compatible API endpoints, which not all serving frameworks expose identically. The DICL feature requires historical data to be effective, meaning it provides limited value at cold start. ## Who Should Use This TensorZero is built for engineering teams running LLM applications at scale who need more than a simple API wrapper. Organizations spending significant amounts on LLM inference will benefit from the cost tracking and optimization capabilities. Teams operating across multiple LLM providers who need a unified gateway with high availability gain immediate value. Companies requiring systematic experimentation, such as A/B testing prompts in production, will find the built-in experimentation framework essential. ML engineering teams looking to close the feedback loop between production data and model performance should evaluate TensorZero as their inference infrastructure layer.

Key Features

Universal LLM gateway supporting 20+ providers with unified API
Rust-based gateway delivering <1ms p99 latency at 10k+ QPS
Dynamic In-Context Learning (DICL) for inference-time optimization without fine-tuning
ClickHouse-powered observability with structured inference and feedback storage
Built-in A/B testing, routing strategies, and feature flags for experimentation
Automatic retries, fallbacks, load balancing, and rate limiting for high availability
Cost and usage tracking with granular scope tagging
Evaluation framework with heuristic and LLM-judge benchmarking

Related Projects

TrendingInference

GitHub

165.0K15.0K

Ollama

ollama

MIT234

Open Source

TensorZero

Key Features

Tags

Related Projects

Ollama

llama.cpp

Unsloth

LiteLLM