OpenAI’s New Open-Weight Models: gpt-oss-120b & 20b

8月7日 Published inAI Models

OpenAI has released two new open-weight language models: gpt-oss-120b and gpt-oss-20b. These models are engineered for complex reasoning, agentic workflows, and a broad spectrum of developer needs.

Model Versions

  • gpt-oss-120b: A production-grade, general-purpose model designed for high-reasoning demands. It is optimized to run on a single H100 GPU, utilizing a Mixture-of-Experts (MoE) architecture with 117B total parameters and 5.1B active parameters.
  • gpt-oss-20b: Optimized for low latency, this model is ideal for local deployments or dedicated hardware setups. It features 21B total parameters with 3.6B active parameters.

Both models were trained using the Harmony response format. Adherence to this specific format is required for the models to function correctly.

Core Features

  • Apache 2.0 License: Provides freedom for experimentation, customization, and commercial use without copyleft restrictions or patent risks.
  • Adjustable Reasoning Effort: Users can toggle between low, medium, or high reasoning settings to balance depth against latency requirements.
  • Full Chain-of-Thought: Access the model’s entire reasoning process to improve debugging and build user trust.
  • Fine-tunable: The architecture supports adaptation to niche use cases and specialized datasets.
  • Agentic Capabilities: Native support for function calling, web browsing, Python code execution, and structured outputs.
  • Native MXFP4 Quantization: MoE layers utilize MXFP4 precision. This efficiency allows the 120b model to fit on a single H100, while the 20b model can operate within 16GB of VRAM.

Inference Examples

Transformers

These models are compatible with the Hugging Face Transformers library. The included chat template applies the Harmony format automatically. If you invoke model.generate directly, you must manually apply the Harmony format through the chat template or the openai-harmony package.

from transformers import pipeline
import torch

model_id = "openai/gpt-oss-120b"

pipe = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype="auto",
    device_map="auto",
)

messages = [
    {"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]

outputs = pipe(messages, max_new_tokens=256)
print(outputs[0]["generated_text"][-1])

vLLM

The vLLM implementation uses uv for streamlined Python dependency management. You can launch an OpenAI-compatible server using the following commands:

uv pip install --pre vllm==0.10.1+gptoss \
    --extra-index-url https://wheels.vllm.ai/gpt-oss/ \
    --extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
    --index-strategy unsafe-best-match

vllm serve openai/gpt-oss-20b

PyTorch / Triton / Metal

These specific implementations are provided primarily for educational reference and architectural transparency. They are not recommended for production environments.

Ollama

To run gpt-oss on consumer-grade hardware via Ollama:

# gpt-oss-20b
ollama pull gpt-oss:20b
ollama run gpt-oss:20b

# gpt-oss-120b
ollama pull gpt-oss:120b
ollama run gpt-oss:120b

LM Studio

Models can be retrieved directly within LM Studio:

# gpt-oss-20b
lms get openai/gpt-oss-20b
# gpt-oss-120b
lms get openai/gpt-oss-120b

About This Repository

The repository includes several reference implementations to assist with deployment and development:

Inference Options

  • torch: An unoptimized PyTorch implementation intended for educational purposes. Due to a lack of optimization, it requires at least four H100 GPUs.
  • triton: A more efficient implementation utilizing PyTorch and Triton, featuring CUDA graphs and basic KV caching.
  • metal: A specialized implementation optimized for Apple Silicon hardware.

Tools

  • browser: A reference for the web-browsing tool used during model training.
  • python: A stateless reference implementation for the Python execution tool.

Client Examples

  • chat: A basic terminal-based chat application. It supports PyTorch, Triton, or vLLM backends and integrates the Python and browser tools.
  • responses_api: An example server demonstrating the Responses API, including browser tool integration and other compatible features.

Setup

Requirements

  • Python: version 3.12 or higher
  • macOS: Requires Xcode CLI tools (xcode-select --install)
  • Linux: CUDA is required for GPU acceleration
  • Windows: Not officially tested. Users on Windows are encouraged to use Ollama for local execution.

Installation

Install via PyPI based on your hardware needs:

# Tools only
pip install gpt-oss
# Including Torch implementation
pip install gpt-oss[torch]
# Including Triton implementation
pip install gpt-oss[triton]

To customize the code or utilize the Metal backend, clone the repository:

git clone https://github.com/openai/gpt-oss.git
GPTOSS_BUILD_METAL=1 pip install -e ".[metal]"

Downloading Models

Use the Hugging Face CLI to retrieve the weights:

# gpt-oss-120b
huggingface-cli download openai/gpt-oss-120b --include "original/*" --local-dir gpt-oss-120b/

# gpt-oss-20b
huggingface-cli download openai/gpt-oss-20b --include "original/*" --local-dir gpt-oss-20b/

Reference Implementations

PyTorch Implementation

Located in gpt_oss/torch/model.py, this is a standard reference implementation. It uses basic PyTorch operations to illustrate the model architecture, incorporating light tensor parallelism in the MoE layers to allow the 120b model to run on 4xH100 or 2xH200 configurations. All weights are upcast to BF16.

To run:

pip install -e .[torch]
# Example using 4xH100:
torchrun --nproc-per-node=4 -m gpt_oss.generate gpt-oss-120b/original/

Triton Implementation (Single GPU)

This optimized version uses custom Triton MoE kernels with MXFP4 support and attention optimizations to reduce memory overhead. It requires nightly versions of Torch and Triton. This setup enables the gpt-oss-120b model to run on a single 80GB GPU.

Installation:

git clone https://github.com/triton-lang/triton
cd triton/
pip install -r python/requirements.txt
pip install -e . --verbose --no-build-isolation

pip install -e .[triton]

Execution:

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
python -m gpt_oss.generate --backend triton gpt-oss-120b/original/

Note: If you encounter Out-of-Memory (OOM) errors, verify that expandable_segments is enabled.

Metal Implementation

Designed for Apple Silicon, this implementation matches PyTorch’s precision but is not intended for production use.

Install with the [metal] flag to trigger automatic compilation. To use it, convert the SafeTensor weights:

python gpt_oss/metal/scripts/create-local-model.py -s <model_dir> -d <output_file>

Alternatively, download pre-converted weights:

huggingface-cli download openai/gpt-oss-120b --include "metal/*" --local-dir gpt-oss-120b/metal/
huggingface-cli download openai/gpt-oss-20b --include "metal/*" --local-dir gpt-oss-20b/metal/

Testing the setup:

python gpt_oss/metal/examples/generate.py gpt-oss-20b/metal/model.bin -p "Why did the chicken cross the road?"

Harmony Format and Tools

OpenAI has released the harmony chat library alongside these models to facilitate interaction. This includes system tools for web browsing and a Python execution container.

Clients

Terminal Chat

This application demonstrates the Harmony format across PyTorch, Triton, and vLLM backends. It supports optional Python and browser tool integration.

usage: python -m gpt_oss.chat [-h] [-r REASONING_EFFORT] [-a] [-b] [--show-browser-results] [-p] [--developer-message DEVELOPER_MESSAGE] [-c CONTEXT] [--raw] [--backend {triton,torch,vllm}] FILE

Note: The Torch and Triton backends require raw checkpoints in the /original/ directory, whereas vLLM uses the Hugging Face checkpoints in the model root.

Responses API

A sample Responses API server is provided as a foundation for custom implementations. While it does not cover every event type, it serves as a functional starting point. Several inference partners also provide their own Responses API implementations.

Launch the server by specifying the desired backend:

python -m gpt_oss.responses_api.serve [--checkpoint FILE] [--port PORT] [--inference-backend BACKEND]

Codex

To configure Codex as a gpt-oss client (using the 20b model as an example), modify ~/.codex/config.toml:

disable_response_storage = true
show_reasoning_content = true

[model_providers.local]
name = "local"
base_url = "http://localhost:11434/v1"

[profiles.oss]
model = "gpt-oss:20b"
model_provider = "local"

Then execute:

ollama run gpt-oss:20b
codex -p oss

Tools

Browser

Both models are trained to interact with a browser tool featuring search, open, and find methods. This is enabled by adding tool definitions to the system message of the Harmony prompt. You can use with_browser() for the standard interface or with_tools() for custom configurations.

The implementation (e.g., SimpleBrowserTool with an Exa backend) manages context size via a scrollable text window. Because it caches requests, a new browser instance should be initialized for each unique request.

Python

The models can utilize a Python tool for computational tasks within the chain-of-thought process. While the training environment was stateful, this reference implementation is stateless and runs within a Docker container. Developers should be mindful of prompt injection risks and implement strict container restrictions for production use. Enable this via with_python() or with_tools().

Apply Patch

The apply_patch function allows the model to perform file operations, including creating, updating, or deleting local files.

Other Details

Precision and Quantization

The models support native quantization. Specifically, the linear projection weights in the MoE layers use MXFP4. These tensors consist of two components:

  • tensor.blocks: Packed fp4 values (two per uint8).
  • tensor.scales: Block-level scales along the final dimension.

Remaining tensors use BF16, which is also the recommended format for activation precision.

Recommended Sampling Parameters

For optimal performance, use a temperature of 1.0 and a top_p of 1.0 during inference.