OpenAI has released two new open-weight language models: gpt-oss-120b and gpt-oss-20b. These models are engineered for complex reasoning, agentic workflows, and a broad spectrum of developer needs.
Model Versions
Both models were trained using the Harmony response format. Adherence to this specific format is required for the models to function correctly.
Core Features
Inference Examples
These models are compatible with the Hugging Face Transformers library. The included chat template applies the Harmony format automatically. If you invoke model.generate directly, you must manually apply the Harmony format through the chat template or the openai-harmony package.
from transformers import pipeline
import torch
model_id = "openai/gpt-oss-120b"
pipe = pipeline(
"text-generation",
model=model_id,
torch_dtype="auto",
device_map="auto",
)
messages = [
{"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]
outputs = pipe(messages, max_new_tokens=256)
print(outputs[0]["generated_text"][-1])
The vLLM implementation uses uv for streamlined Python dependency management. You can launch an OpenAI-compatible server using the following commands:
uv pip install --pre vllm==0.10.1+gptoss \
--extra-index-url https://wheels.vllm.ai/gpt-oss/ \
--extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
--index-strategy unsafe-best-match
vllm serve openai/gpt-oss-20b
These specific implementations are provided primarily for educational reference and architectural transparency. They are not recommended for production environments.
To run gpt-oss on consumer-grade hardware via Ollama:
# gpt-oss-20b
ollama pull gpt-oss:20b
ollama run gpt-oss:20b
# gpt-oss-120b
ollama pull gpt-oss:120b
ollama run gpt-oss:120b
Models can be retrieved directly within LM Studio:
# gpt-oss-20b
lms get openai/gpt-oss-20b
# gpt-oss-120b
lms get openai/gpt-oss-120b
About This Repository
The repository includes several reference implementations to assist with deployment and development:
torch: An unoptimized PyTorch implementation intended for educational purposes. Due to a lack of optimization, it requires at least four H100 GPUs.triton: A more efficient implementation utilizing PyTorch and Triton, featuring CUDA graphs and basic KV caching.metal: A specialized implementation optimized for Apple Silicon hardware.browser: A reference for the web-browsing tool used during model training.python: A stateless reference implementation for the Python execution tool.chat: A basic terminal-based chat application. It supports PyTorch, Triton, or vLLM backends and integrates the Python and browser tools.responses_api: An example server demonstrating the Responses API, including browser tool integration and other compatible features.Setup
xcode-select --install)Install via PyPI based on your hardware needs:
# Tools only
pip install gpt-oss
# Including Torch implementation
pip install gpt-oss[torch]
# Including Triton implementation
pip install gpt-oss[triton]
To customize the code or utilize the Metal backend, clone the repository:
git clone https://github.com/openai/gpt-oss.git
GPTOSS_BUILD_METAL=1 pip install -e ".[metal]"
Use the Hugging Face CLI to retrieve the weights:
# gpt-oss-120b
huggingface-cli download openai/gpt-oss-120b --include "original/*" --local-dir gpt-oss-120b/
# gpt-oss-20b
huggingface-cli download openai/gpt-oss-20b --include "original/*" --local-dir gpt-oss-20b/
Reference Implementations
Located in gpt_oss/torch/model.py, this is a standard reference implementation. It uses basic PyTorch operations to illustrate the model architecture, incorporating light tensor parallelism in the MoE layers to allow the 120b model to run on 4xH100 or 2xH200 configurations. All weights are upcast to BF16.
To run:
pip install -e .[torch]
# Example using 4xH100:
torchrun --nproc-per-node=4 -m gpt_oss.generate gpt-oss-120b/original/
This optimized version uses custom Triton MoE kernels with MXFP4 support and attention optimizations to reduce memory overhead. It requires nightly versions of Torch and Triton. This setup enables the gpt-oss-120b model to run on a single 80GB GPU.
Installation:
git clone https://github.com/triton-lang/triton
cd triton/
pip install -r python/requirements.txt
pip install -e . --verbose --no-build-isolation
pip install -e .[triton]
Execution:
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
python -m gpt_oss.generate --backend triton gpt-oss-120b/original/
Note: If you encounter Out-of-Memory (OOM) errors, verify that expandable_segments is enabled.
Designed for Apple Silicon, this implementation matches PyTorch’s precision but is not intended for production use.
Install with the [metal] flag to trigger automatic compilation. To use it, convert the SafeTensor weights:
python gpt_oss/metal/scripts/create-local-model.py -s <model_dir> -d <output_file>
Alternatively, download pre-converted weights:
huggingface-cli download openai/gpt-oss-120b --include "metal/*" --local-dir gpt-oss-120b/metal/
huggingface-cli download openai/gpt-oss-20b --include "metal/*" --local-dir gpt-oss-20b/metal/
Testing the setup:
python gpt_oss/metal/examples/generate.py gpt-oss-20b/metal/model.bin -p "Why did the chicken cross the road?"
Harmony Format and Tools
OpenAI has released the harmony chat library alongside these models to facilitate interaction. This includes system tools for web browsing and a Python execution container.
Clients
This application demonstrates the Harmony format across PyTorch, Triton, and vLLM backends. It supports optional Python and browser tool integration.
usage: python -m gpt_oss.chat [-h] [-r REASONING_EFFORT] [-a] [-b] [--show-browser-results] [-p] [--developer-message DEVELOPER_MESSAGE] [-c CONTEXT] [--raw] [--backend {triton,torch,vllm}] FILE
Note: The Torch and Triton backends require raw checkpoints in the /original/ directory, whereas vLLM uses the Hugging Face checkpoints in the model root.
A sample Responses API server is provided as a foundation for custom implementations. While it does not cover every event type, it serves as a functional starting point. Several inference partners also provide their own Responses API implementations.
Launch the server by specifying the desired backend:
python -m gpt_oss.responses_api.serve [--checkpoint FILE] [--port PORT] [--inference-backend BACKEND]
To configure Codex as a gpt-oss client (using the 20b model as an example), modify ~/.codex/config.toml:
disable_response_storage = true
show_reasoning_content = true
[model_providers.local]
name = "local"
base_url = "http://localhost:11434/v1"
[profiles.oss]
model = "gpt-oss:20b"
model_provider = "local"
Then execute:
ollama run gpt-oss:20b
codex -p oss
Tools
Both models are trained to interact with a browser tool featuring search, open, and find methods. This is enabled by adding tool definitions to the system message of the Harmony prompt. You can use with_browser() for the standard interface or with_tools() for custom configurations.
The implementation (e.g., SimpleBrowserTool with an Exa backend) manages context size via a scrollable text window. Because it caches requests, a new browser instance should be initialized for each unique request.
The models can utilize a Python tool for computational tasks within the chain-of-thought process. While the training environment was stateful, this reference implementation is stateless and runs within a Docker container. Developers should be mindful of prompt injection risks and implement strict container restrictions for production use. Enable this via with_python() or with_tools().
The apply_patch function allows the model to perform file operations, including creating, updating, or deleting local files.
Other Details
The models support native quantization. Specifically, the linear projection weights in the MoE layers use MXFP4. These tensors consist of two components:
tensor.blocks: Packed fp4 values (two per uint8).tensor.scales: Block-level scales along the final dimension.Remaining tensors use BF16, which is also the recommended format for activation precision.
For optimal performance, use a temperature of 1.0 and a top_p of 1.0 during inference.
MuMuAINovel: Write Novels With AI, Minus the Clutter
Fast RAG: Deploy a Private Hybrid Search RAG Stack Locally
AI Presentation Generator: An Open-Source Gamma Alternative for Slide Decks
Liebao VPN Free Trial: 4K Streaming & Easy Setup on Any Device
Flyde Visual Programming: Custom Nodes & Code Integration
Kode CLI: A Multi-Model AI Terminal Assistant for Smarter Development
Besnow Cloud VPN: 60% Off Coupon + 30-Day Free Trial
SafeLine WAF Installation: System Requirements & Setup Guide
Agents From Scratch: AI Email Assistant with Human-in-the-Loop Approval
Mevzuat MCP: Search Turkish Legislation Directly in Claude
Paperless GPT: Smarter OCR and Auto-Tagging for Paperless-NGX
InvenTree Inventory: Self-Hosted Stock Control with REST API