Helicone AI Gateway: A High-Performance Rust-Powered LLM Proxy

7月4日 Published inAPI Tools

The Helicone AI Gateway is a high-performance, lightweight AI proxy developed by the Helicone team and released as an open-source project. Its minimal footprint and straightforward configuration allow it to manage heavy production throughput with ease.

A single endpoint provides access to more than 100 different models. Built with Rust for maximum efficiency, the gateway remains responsive even when processing millions of LLM requests. It serves a similar role to NGINX, but is engineered specifically for the modern AI infrastructure stack.

Installation & Configuration

  1. Define Environment Variables
    Add your provider credentials to your .env file:
OPENAI_API_KEY=your_openai_key
ANTHROPIC_API_KEY=your_anthropic_key
  1. Launch Locally
    Execute the following command in your terminal:
npx @helicone/ai-gateway@latest
  1. Execute a Request
    You can use any OpenAI-compatible SDK. Here is an example using Python:
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/ai",
    api_key="placeholder-api-key"  # The gateway manages the actual keys securely
)

# Use one interface for any LLM provider. The gateway handles the routing.
response = client.chat.completions.create(
    model="anthropic/claude-3-5-sonnet",  # Compatible with 100+ other models
    messages=[{"role": "user", "content": "Hello from Helicone AI Gateway!"}]
)

This setup requires no new SDKs to learn and avoids complex integrations. It is a fully open and functional solution for multi-model management.

Why Run Helicone AI Gateway?

Unified Interface
Maintain your existing OpenAI syntax while calling Anthropic, Google, AWS Bedrock, or any of the 20+ supported providers. You can switch models without rewriting your integration logic.

Intelligent Provider Selection
Optimize your requests based on speed, cost, or reliability. The gateway supports advanced load-balancing strategies, including latency-aware P2C with PeakEWMA, weighted distribution, and cost-based routing. It also tracks provider health and rate limits in real-time.

Cost Guardrails
Prevent budget overruns and usage abuse with robust rate limiting. You can define caps for specific users, teams, or the entire organization based on request volume, token consumption, or total spend.

Performance Gains
Response caching can reduce latency and API costs by up to 95%. The system supports Redis or S3 backends and includes built-in smart invalidation logic.

Simplified Observability
The gateway comes pre-configured for Helicone and offers full OpenTelemetry support. This provides immediate access to logs, metrics, and traces for debugging and performance monitoring.

Rapid Deployment
Run the gateway as a single binary or a Docker container on your own infrastructure. You can be live in seconds by following the standard deployment guide.

Production-Ready Throughput

Metric Helicone AI Gateway Typical Setup
P95 Latency <10ms ~60-100ms
Memory Usage ~64MB ~512MB
Requests per Second ~2,000 ~500
Binary Size ~15MB ~200MB
Cold Start Time ~100ms ~2s

Note: These are preliminary figures. Full benchmark methodology and detailed results are available in benchmarks/README.md.

How It Works

┌────────────┐    ┌─────────────┐    ┌──────────────┐
│  Your App  │───▶│ Helicone AI │───▶│  LLM Providers│
│            │    │ Gateway     │    │              │
│ OpenAI SDK │    │             │    │ • OpenAI     │
│ (any lang) │    │ • Load Bal. │    │ • Anthropic  │
│            │    │ • Rate Limit│    │ • AWS Bedrock│
│            │    │ • Caching   │    │ • Google Vertex│
│            │    │ • Tracing   │    │ • 20+ more   │
└────────────┘    └─────────────┘    └──────────────┘
                           │
                           ▼
                   ┌─────────────────┐
                   │ Helicone        │
                   │ Observability   │
                   │                 │
                   │ • Dashboards    │
                   │ • Metrics       │
                   │ • Monitoring    │
                   │ • Debugging     │
                   └─────────────────┘

Custom Configuration

  1. Environment Variables
    Store your provider keys in the .env file:
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
HELICONE_API_KEY=sk-...
  1. Configuration File Example
    Below is a sample config.yaml. For a complete list of options, refer to the configuration guide and the supported provider list.
helicone: # Define HELICONE_API_KEY in your .env
  observability: true
  authentication: true

cache-store:
  in-memory: {}

global: # Applied across all routers
  cache:
    directive: "max-age=3600, max-stale=1800"

routers:
  your-router-name: # Specific settings for this router
    load-balance:
      chat:
        strategy: latency
        targets:
          - openai
          - anthropic

    rate-limit:
      per-api-key:
        capacity: 1000
        refill-frequency: 1m # Allows 1000 requests per minute
  1. Launch with Custom Configuration
npx @helicone/ai-gateway@latest --config config.yaml
  1. Execute a Request via the Router
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/router/your-router-name",
    api_key="placeholder-api-key"  # The gateway handles the actual provider keys
)

response = client.chat.completions.create(
    model="anthropic/claude-3-5-sonnet",  # Or any other supported model
    messages=[{"role": "user", "content": "Hello from Helicone AI Gateway!"}]
)

Migration Guide

From OpenAI (Python)

from openai import OpenAI

client = OpenAI(
-   api_key=os.getenv("OPENAI_API_KEY")
+   api_key="placeholder-api-key"  # Handled by the gateway
+   base_url="http://localhost:8080/router/your-router-name"
)

# The rest of your code remains exactly the same.
response = client.chat.completions.create(
    model="openai/gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}]
)

From OpenAI (TypeScript)

import { OpenAI } from "openai";

const client = new OpenAI({
-   apiKey: os.getenv("OPENAI_API_KEY")
+   apiKey: "placeholder-api-key",  // Handled by the gateway
+   baseURL: "http://localhost:8080/router/your-router-name",
});

const response = await client.chat.completions.create({
  model: "openai/gpt-4o",
  messages: [{ role: "user", content: "Hello from Helicone AI Gateway!" }],
});