TensorZero establishes a continuous feedback loop for LLM optimization, converting production data into models that are more intelligent, efficient, and cost-effective. The platform integrates a high-performance model gateway with automated metrics collection to optimize prompts, models, and inference strategies, ensuring LLM performance improves alongside real-world usage.
TensorZero provides the infrastructure needed to build and scale production-grade LLM applications. It unifies gateway logic, observability, optimization, and evaluation into a single "data-and-learning flywheel."
Inference – A unified API compatible with any LLM, maintaining P99 latency overhead under 1ms.
Observability – Automatically persists inference data and feedback into a centralized database.
Optimization – A complete pipeline supporting everything from prompt engineering to supervised fine-tuning (SFT) and reinforcement learning (RL).
Evaluation – Tools to systematically compare prompts, models, and complex inference strategies.
Experiments – Native support for A/B testing, dynamic routing, and automated failover.
TensorZero is designed for complex LLM applications that require industrial-strength reliability: low latency, high throughput, type safety, and self-hosting capabilities. By unifying the LLMOps stack, it enables GitOps-driven workflows and produces compound performance gains through structured iteration.
Once integrated, TensorZero serves as a single entry point for every major LLM provider. Native support includes:
• Anthropic, AWS Bedrock, AWS SageMaker • Azure OpenAI Service, DeepSeek, Fireworks • GCP Vertex AI (Anthropic and Gemini) • Google AI Studio (Gemini API), Hyperbolic, Mistral • OpenAI, Together, vLLM, xAI
Any provider offering an OpenAI-compatible API (such as Ollama) is also supported. The gateway provides several advanced features:
• Automated retries, failover, and inference-time optimization • Prompt templating, schema enforcement, and batch inference • A/B testing and configuration-as-code (GitOps) • Multimodal inference (VLM) and inference caching • Multi-step LLM workflows (Episodes) and integrated feedback collection
Written in Rust, the TensorZero gateway is built for high-scale environments, sustaining under 1ms P99 latency overhead at 10,000 QPS. Developers can interact with the gateway through several methods.
Python client (recommended)
pip install tensorzero
from tensorzero import TensorZeroGateway
with TensorZeroGateway.build_embedded(clickhouse_url="...", config_file="...") as client:
response = client.inference(
model_name="openai::gpt-4o-mini",
input={
"messages": [
{"role": "user", "content": "Write a haiku about artificial intelligence."}
]
},
)
OpenAI Python client
pip install tensorzero
from openai import OpenAI
from tensorzero import patch_openai_client
client = OpenAI()
patch_openai_client(
client,
clickhouse_url="http://chuser:chpassword@localhost:8123/tensorzero",
config_file="config/tensorzero.toml",
)
response = client.chat.completions.create(
model="tensorzero::model_name::openai::gpt-4o-mini",
messages=[
{"role": "user", "content": "Write a haiku about artificial intelligence."}
],
)
JavaScript / TypeScript (Node) OpenAI client
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:3000/openai/v1",
});
const response = await client.chat.completions.create({
model: "tensorzero::model_name::openai::gpt-4o-mini",
messages: [
{
role: "user",
content: "Write a haiku about artificial intelligence.",
},
],
});
Other languages and platforms (HTTP API)
curl -X POST "http://localhost:3000/inference" \
-H "Content-Type: application/json" \
-d '{
"model_name": "openai::gpt-4o-mini",
"input": {
"messages": [
{
"role": "user",
"content": "Write a haiku about artificial intelligence."
}
]
}
}'
TensorZero allows you to ingest production metrics and human feedback—either via the UI or programmatically—to refine your LLM implementation.
Model Optimization • Fine-tune closed and open-source models using Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). • Access an integrated SFT interface and DPO-ready Jupyter notebooks.
Inference-Time Optimization
Prompt Optimization • Implement research-backed techniques such as MIPROv2. • Integrate with automated prompt engineering frameworks like DSPy.
The open-source TensorZero UI provides tools to debug individual API calls or monitor performance trends across different models and prompts.
Evaluate prompts, models, and inference strategies using TensorZero’s suite of testing tools. The platform supports both heuristic-based evaluation and LLM-as-a-judge workflows.
docker compose run --rm evaluations \
--evaluation-name extract_data \
--dataset-name hard_test_cases \
--variant-name gpt_4o \
--concurrency 5
The gateway functions as a high-performance entry point with a unified API, managing structured inference via schemas while capturing downstream feedback. All data is stored in a ClickHouse warehouse under your control. TensorZero "recipes" then allow you to optimize prompts and models based on this structured dataset. By combining these experimental features with GitOps orchestration, teams can iterate and deploy improvements rapidly.
The quickstart guide provides a five-minute path from a basic OpenAI wrapper to a production-ready LLM application featuring full observability and fine-tuning. Detailed tutorials are available for building chatbots, email assistants, weather-based RAG systems, and structured data extraction pipelines.
Optimizing Data Extraction (NER) Learn how to refine a Named Entity Recognition pipeline using fine-tuning and Dynamic In-Context Learning (DICL). This approach can enable a GPT-4o Mini model to outperform GPT-4o on specific tasks while reducing cost and latency.
Agentic RAG – Multi-Hop Q&A Develop a retrieval agent capable of performing multi-hop searches across sources like Wikipedia. The agent autonomously determines when it has gathered sufficient context to resolve complex queries.
Preference-Aligned Content Generation Fine-tune GPT-4o Mini to generate creative content, such as haikus, that aligns with specific stylistic preferences. This demonstrates the data flywheel in action: superior variants generate better training data, which in turn creates even more refined variants.
Boosting Logic with Best-of-N Sampling Enhance an LLM's strategic reasoning—such as chess performance—by generating multiple move options and selecting the most promising one. This technique significantly improves output quality without changing the underlying model.
Custom DSPy Optimization Recipes Use external tools like DSPy to optimize TensorZero functions, allowing for custom optimization workflows tailored to specific mathematical or logical reasoning tasks.
TensorZero was developed by a team with deep expertise in systems and machine learning, including former Rust compiler maintainers, ML researchers, and veteran startup executives. The platform is designed to help engineers build and manage LLM applications that learn and improve from real-world experience.
Tiny Qwen: A Clean PyTorch Implementation of Qwen3 and Qwen2.5-VL
CloudRocket VPN Promo Code: 10% Discount + Upgraded 400GB/Month Plan
FireRedTTS‑2: Stream Voice Cloning for Long‑Form Podcasts and Chatbots
Fuck-U-Code: A Brutally Honest Code Quality Analyzer
ntopng Network Traffic Monitor: Identify Bandwidth Consumption and Network Bottlenecks
LandPPT: Create AI-Powered Presentations from Any Document
Easy-AI-CodeReview: LLM-Powered Automated Code Review for GitLab
Notes MCP Guide: Connect Apple Notes to Claude, Cursor, and LLMs
Gmail AutoAuth MCP Server: Control Gmail via Claude Desktop
TypeAgent: Build AI Agents With Structured Memory and Human-in-the-Loop
Natural Language CAD Control via CAD-MCP Server
How to Install Unregistered Apps on Android