TensorZero establishes a continuous feedback loop for LLM optimization, converting production data into models that are more intelligent, efficient, and cost-effective. The platform integrates a high-performance model gateway with automated metrics collection to optimize prompts, models, and inference strategies, ensuring LLM performance improves alongside real-world usage.
TensorZero provides the infrastructure needed to build and scale production-grade LLM applications. It unifies gateway logic, observability, optimization, and evaluation into a single "data-and-learning flywheel."
Inference – A unified API compatible with any LLM, maintaining P99 latency overhead under 1ms.
Observability – Automatically persists inference data and feedback into a centralized database.
Optimization – A complete pipeline supporting everything from prompt engineering to supervised fine-tuning (SFT) and reinforcement learning (RL).
Evaluation – Tools to systematically compare prompts, models, and complex inference strategies.
Experiments – Native support for A/B testing, dynamic routing, and automated failover.
TensorZero is designed for complex LLM applications that require industrial-strength reliability: low latency, high throughput, type safety, and self-hosting capabilities. By unifying the LLMOps stack, it enables GitOps-driven workflows and produces compound performance gains through structured iteration.
Once integrated, TensorZero serves as a single entry point for every major LLM provider. Native support includes:
• Anthropic, AWS Bedrock, AWS SageMaker • Azure OpenAI Service, DeepSeek, Fireworks • GCP Vertex AI (Anthropic and Gemini) • Google AI Studio (Gemini API), Hyperbolic, Mistral • OpenAI, Together, vLLM, xAI
Any provider offering an OpenAI-compatible API (such as Ollama) is also supported. The gateway provides several advanced features:
• Automated retries, failover, and inference-time optimization • Prompt templating, schema enforcement, and batch inference • A/B testing and configuration-as-code (GitOps) • Multimodal inference (VLM) and inference caching • Multi-step LLM workflows (Episodes) and integrated feedback collection
Written in Rust, the TensorZero gateway is built for high-scale environments, sustaining under 1ms P99 latency overhead at 10,000 QPS. Developers can interact with the gateway through several methods.
Python client (recommended)
pip install tensorzero
from tensorzero import TensorZeroGateway
with TensorZeroGateway.build_embedded(clickhouse_url="...", config_file="...") as client:
response = client.inference(
model_name="openai::gpt-4o-mini",
input={
"messages": [
{"role": "user", "content": "Write a haiku about artificial intelligence."}
]
},
)
OpenAI Python client
pip install tensorzero
from openai import OpenAI
from tensorzero import patch_openai_client
client = OpenAI()
patch_openai_client(
client,
clickhouse_url="http://chuser:chpassword@localhost:8123/tensorzero",
config_file="config/tensorzero.toml",
)
response = client.chat.completions.create(
model="tensorzero::model_name::openai::gpt-4o-mini",
messages=[
{"role": "user", "content": "Write a haiku about artificial intelligence."}
],
)
JavaScript / TypeScript (Node) OpenAI client
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:3000/openai/v1",
});
const response = await client.chat.completions.create({
model: "tensorzero::model_name::openai::gpt-4o-mini",
messages: [
{
role: "user",
content: "Write a haiku about artificial intelligence.",
},
],
});
Other languages and platforms (HTTP API)
curl -X POST "http://localhost:3000/inference" \
-H "Content-Type: application/json" \
-d '{
"model_name": "openai::gpt-4o-mini",
"input": {
"messages": [
{
"role": "user",
"content": "Write a haiku about artificial intelligence."
}
]
}
}'
TensorZero allows you to ingest production metrics and human feedback—either via the UI or programmatically—to refine your LLM implementation.
Model Optimization • Fine-tune closed and open-source models using Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). • Access an integrated SFT interface and DPO-ready Jupyter notebooks.
Inference-Time Optimization
Prompt Optimization • Implement research-backed techniques such as MIPROv2. • Integrate with automated prompt engineering frameworks like DSPy.
The open-source TensorZero UI provides tools to debug individual API calls or monitor performance trends across different models and prompts.
Evaluate prompts, models, and inference strategies using TensorZero’s suite of testing tools. The platform supports both heuristic-based evaluation and LLM-as-a-judge workflows.
docker compose run --rm evaluations \
--evaluation-name extract_data \
--dataset-name hard_test_cases \
--variant-name gpt_4o \
--concurrency 5
The gateway functions as a high-performance entry point with a unified API, managing structured inference via schemas while capturing downstream feedback. All data is stored in a ClickHouse warehouse under your control. TensorZero "recipes" then allow you to optimize prompts and models based on this structured dataset. By combining these experimental features with GitOps orchestration, teams can iterate and deploy improvements rapidly.
The quickstart guide provides a five-minute path from a basic OpenAI wrapper to a production-ready LLM application featuring full observability and fine-tuning. Detailed tutorials are available for building chatbots, email assistants, weather-based RAG systems, and structured data extraction pipelines.
Optimizing Data Extraction (NER) Learn how to refine a Named Entity Recognition pipeline using fine-tuning and Dynamic In-Context Learning (DICL). This approach can enable a GPT-4o Mini model to outperform GPT-4o on specific tasks while reducing cost and latency.
Agentic RAG – Multi-Hop Q&A Develop a retrieval agent capable of performing multi-hop searches across sources like Wikipedia. The agent autonomously determines when it has gathered sufficient context to resolve complex queries.
Preference-Aligned Content Generation Fine-tune GPT-4o Mini to generate creative content, such as haikus, that aligns with specific stylistic preferences. This demonstrates the data flywheel in action: superior variants generate better training data, which in turn creates even more refined variants.
Boosting Logic with Best-of-N Sampling Enhance an LLM's strategic reasoning—such as chess performance—by generating multiple move options and selecting the most promising one. This technique significantly improves output quality without changing the underlying model.
Custom DSPy Optimization Recipes Use external tools like DSPy to optimize TensorZero functions, allowing for custom optimization workflows tailored to specific mathematical or logical reasoning tasks.
TensorZero was developed by a team with deep expertise in systems and machine learning, including former Rust compiler maintainers, ML researchers, and veteran startup executives. The platform is designed to help engineers build and manage LLM applications that learn and improve from real-world experience.
DeepSeek-OCR: High-Speed Visual Text Compression That Actually Works
AI Podcast Transcriber Turns Audio Into Clean Text and Smart Summaries
Yank Note Review: A Hackable Markdown Editor That Runs Code
LandPPT: Create AI-Powered Presentations from Any Document
BuildAdmin: Vue 3 + ThinkPHP 8 Admin Panel with CRUD Generator
Machine Learning for Beginners: A Free 26-Lesson Curriculum
LLM Bridge: A Unified API Schema for OpenAI, Claude, and Gemini
AI Peer Review Tool for Neuroscience: LLM-Driven Meta-Review Framework
Deep Search Lighting: Lightweight Web Search for LLMs
Cuby Text: Open-Source Block-Based Knowledge Management
MCP SuperAssistant: Bring MCP Tools to ChatGPT, Gemini, and Beyond
IOPaint: Free Open-Source Image Inpainting and Object Removal