TensorZero: Optimize LLM Applications with Production Feedback

6月13日 Published inAI Models

TensorZero establishes a continuous feedback loop for LLM optimization, converting production data into models that are more intelligent, efficient, and cost-effective. The platform integrates a high-performance model gateway with automated metrics collection to optimize prompts, models, and inference strategies, ensuring LLM performance improves alongside real-world usage.

TensorZero provides the infrastructure needed to build and scale production-grade LLM applications. It unifies gateway logic, observability, optimization, and evaluation into a single "data-and-learning flywheel."

Inference – A unified API compatible with any LLM, maintaining P99 latency overhead under 1ms.

Observability – Automatically persists inference data and feedback into a centralized database.

Optimization – A complete pipeline supporting everything from prompt engineering to supervised fine-tuning (SFT) and reinforcement learning (RL).

Evaluation – Tools to systematically compare prompts, models, and complex inference strategies.

Experiments – Native support for A/B testing, dynamic routing, and automated failover.

TensorZero is designed for complex LLM applications that require industrial-strength reliability: low latency, high throughput, type safety, and self-hosting capabilities. By unifying the LLMOps stack, it enables GitOps-driven workflows and produces compound performance gains through structured iteration.

LLM Gateway

Once integrated, TensorZero serves as a single entry point for every major LLM provider. Native support includes:

• Anthropic, AWS Bedrock, AWS SageMaker • Azure OpenAI Service, DeepSeek, Fireworks • GCP Vertex AI (Anthropic and Gemini) • Google AI Studio (Gemini API), Hyperbolic, Mistral • OpenAI, Together, vLLM, xAI

Any provider offering an OpenAI-compatible API (such as Ollama) is also supported. The gateway provides several advanced features:

• Automated retries, failover, and inference-time optimization • Prompt templating, schema enforcement, and batch inference • A/B testing and configuration-as-code (GitOps) • Multimodal inference (VLM) and inference caching • Multi-step LLM workflows (Episodes) and integrated feedback collection

Written in Rust, the TensorZero gateway is built for high-scale environments, sustaining under 1ms P99 latency overhead at 10,000 QPS. Developers can interact with the gateway through several methods.

Python client (recommended)

pip install tensorzero

from tensorzero import TensorZeroGateway
with TensorZeroGateway.build_embedded(clickhouse_url="...", config_file="...") as client:
    response = client.inference(
        model_name="openai::gpt-4o-mini",
        input={
            "messages": [
                {"role": "user", "content": "Write a haiku about artificial intelligence."}
            ]
        },
    )

OpenAI Python client

pip install tensorzero
from openai import OpenAI
from tensorzero import patch_openai_client

client = OpenAI()
patch_openai_client(
    client,
    clickhouse_url="http://chuser:chpassword@localhost:8123/tensorzero",
    config_file="config/tensorzero.toml",
)

response = client.chat.completions.create(
    model="tensorzero::model_name::openai::gpt-4o-mini",
    messages=[
        {"role": "user", "content": "Write a haiku about artificial intelligence."}
    ],
)

JavaScript / TypeScript (Node) OpenAI client

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:3000/openai/v1",
});

const response = await client.chat.completions.create({
  model: "tensorzero::model_name::openai::gpt-4o-mini",
  messages: [
    {
      role: "user",
      content: "Write a haiku about artificial intelligence.",
    },
  ],
});

Other languages and platforms (HTTP API)

curl -X POST "http://localhost:3000/inference" \
  -H "Content-Type: application/json" \
  -d '{
    "model_name": "openai::gpt-4o-mini",
    "input": {
      "messages": [
        {
          "role": "user",
          "content": "Write a haiku about artificial intelligence."
        }
      ]
    }
  }'

LLM Optimization

TensorZero allows you to ingest production metrics and human feedback—either via the UI or programmatically—to refine your LLM implementation.

Model Optimization • Fine-tune closed and open-source models using Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). • Access an integrated SFT interface and DPO-ready Jupyter notebooks.

Inference-Time Optimization

  • Best-of-N Sampling: Generates N candidate responses and selects the optimal result using a designated evaluator.
  • Mixture-of-N Sampling: Aggregates multiple candidates into a single, refined response.
  • Dynamic In-Context Learning (DICL): Automatically injects relevant historical examples into the prompt at query time.
  • Chain of Thought (CoT): Structures prompts to encourage intermediate reasoning for complex problem-solving.

Prompt Optimization • Implement research-backed techniques such as MIPROv2. • Integrate with automated prompt engineering frameworks like DSPy.

LLM Observability

The open-source TensorZero UI provides tools to debug individual API calls or monitor performance trends across different models and prompts.

  • Inference Observability: Inspect inputs, outputs, and processing times for every request.
  • Function Observability: Monitor success rates and performance metrics for specific model variants over time.

LLM Evaluation

Evaluate prompts, models, and inference strategies using TensorZero’s suite of testing tools. The platform supports both heuristic-based evaluation and LLM-as-a-judge workflows.

  • UI Evaluations: Configure test suites and analyze results through a visual dashboard.
  • CLI Evaluations: Execute high-volume batch evaluations from the terminal.
docker compose run --rm evaluations \
  --evaluation-name extract_data \
  --dataset-name hard_test_cases \
  --variant-name gpt_4o \
  --concurrency 5

The TensorZero Flywheel

The gateway functions as a high-performance entry point with a unified API, managing structured inference via schemas while capturing downstream feedback. All data is stored in a ClickHouse warehouse under your control. TensorZero "recipes" then allow you to optimize prompts and models based on this structured dataset. By combining these experimental features with GitOps orchestration, teams can iterate and deploy improvements rapidly.

Getting Started

The quickstart guide provides a five-minute path from a basic OpenAI wrapper to a production-ready LLM application featuring full observability and fine-tuning. Detailed tutorials are available for building chatbots, email assistants, weather-based RAG systems, and structured data extraction pipelines.

Example Use Cases

Optimizing Data Extraction (NER) Learn how to refine a Named Entity Recognition pipeline using fine-tuning and Dynamic In-Context Learning (DICL). This approach can enable a GPT-4o Mini model to outperform GPT-4o on specific tasks while reducing cost and latency.

Agentic RAG – Multi-Hop Q&A Develop a retrieval agent capable of performing multi-hop searches across sources like Wikipedia. The agent autonomously determines when it has gathered sufficient context to resolve complex queries.

Preference-Aligned Content Generation Fine-tune GPT-4o Mini to generate creative content, such as haikus, that aligns with specific stylistic preferences. This demonstrates the data flywheel in action: superior variants generate better training data, which in turn creates even more refined variants.

Boosting Logic with Best-of-N Sampling Enhance an LLM's strategic reasoning—such as chess performance—by generating multiple move options and selecting the most promising one. This technique significantly improves output quality without changing the underlying model.

Custom DSPy Optimization Recipes Use external tools like DSPy to optimize TensorZero functions, allowing for custom optimization workflows tailored to specific mathematical or logical reasoning tasks.

TensorZero was developed by a team with deep expertise in systems and machine learning, including former Rust compiler maintainers, ML researchers, and veteran startup executives. The platform is designed to help engineers build and manage LLM applications that learn and improve from real-world experience.