Transformers Library: Installation, Pipeline API, and Model Examples

7月7日 Published inMachine Learning

Transformers is the foundational framework for state-of-the-art machine learning. It supports text, computer vision, audio, video, and multimodal models for both inference and training.

The library serves as a central hub for model definitions, establishing a unified standard across the machine learning ecosystem. Because transformers is cross-framework, a model defined here works seamlessly with various training frameworks (such as Axolotl, Unsloth, DeepSpeed, FSDP, and PyTorch-Lightning), inference engines (vLLM, SGLang, TGI), and modeling libraries (llama.cpp, mlx). All of these tools rely on transformers to interpret the underlying model architecture.

We are committed to supporting new state-of-the-art models as they emerge. By prioritizing clear, customizable, and efficient model definitions, we make these advanced tools accessible to a wider audience of developers and researchers.

The Hugging Face Hub currently hosts over 1 million Transformers model checkpoints. You can explore the Hub to find the right model for your specific needs and start building immediately.

Installation

Transformers supports Python 3.9+, PyTorch 2.1+, TensorFlow 2.6+, and Flax 0.4.1+.

To begin, create and activate a virtual environment using venv or uv (a high-performance Rust-based Python package manager).

# venv
python -m venv .my-env
source .my-env/bin/activate

# uv
uv venv .my-env
source .my-env/bin/activate

Install Transformers within your virtual environment:

# pip
pip install "transformers[torch]"

# uv
uv pip install "transformers[torch]"

If you wish to contribute to the library or access the latest experimental features, install it from source. Note that the development version may be unstable; please file an issue if you encounter bugs.

git clone https://github.com/huggingface/transformers.git
cd transformers

# pip
pip install .[torch]

# uv
uv pip install .[torch]

Quick Start

The Pipeline API is the fastest way to get started. The pipeline function is a high-level inference class designed for text, audio, vision, and multimodal tasks. It automatically manages input preprocessing and returns structured results.

Text Generation

To generate text, instantiate a pipeline and specify a model. The library downloads and caches the model locally for future use. Once initialized, simply pass your text as a prompt.

from transformers import pipeline

pipeline = pipeline(task="text-generation", model="Qwen/Qwen2.5-1.5B")
pipeline("the secret to baking a really good cake is ")

Output:

[{'generated_text': 'the secret to baking a really good cake is 1) to use the right ingredients and 2) to follow the recipe exactly. the recipe for the cake is as follows: 1 cup of sugar, 1 cup of flour, 1 cup of milk, 1 cup of butter, 1 cup of eggs, 1 cup of chocolate chips. if you want to make 2 cakes, how much sugar do you need? To make 2 cakes, you will need 2 cups of sugar.'}]

Chat with a Model

The workflow for chat models is similar: construct a conversation history between the user and the system, then pass it to the pipeline.

You can also interact with models directly from your terminal:

transformers chat Qwen/Qwen2.5-0.5B-Instruct
import torch
from transformers import pipeline

chat = [
    {"role": "system", "content": "You are a sassy, wise-cracking robot as imagined by Hollywood circa 1986."},
    {"role": "user", "content": "Hey, can you tell me any fun things to do in New York?"}
]

pipeline = pipeline(task="text-generation", model="meta-llama/Meta-Llama-3-8B-Instruct", torch_dtype=torch.bfloat16, device_map="auto")
response = pipeline(chat, max_new_tokens=512)
print(response[0]["generated_text"][-1]["content"])

Other Modality Examples

Automatic Speech Recognition

from transformers import pipeline

pipeline = pipeline(task="automatic-speech-recognition", model="openai/whisper-large-v3")
pipeline("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac")

Output:

{'text': ' I have a dream that one day this nation will rise up and live out the true meaning of its creed.'}

Image Classification

from transformers import pipeline

pipeline = pipeline(task="image-classification", model="facebook/dinov2-small-imagenet1k-1-layer")
pipeline("https://huggingface.co/datasets/Narsil/image_dummy/raw/main/parrots.png")

Output:

[{'label': 'macaw', 'score': 0.997848391532898},
 {'label': 'sulphur-crested cockatoo, Kakatoe galerita, Cacatua galerita',
  'score': 0.0016551691805943847},
 {'label': 'lorikeet', 'score': 0.00018523589824326336},
 {'label': 'African grey, African gray, Psittacus erithacus',
  'score': 7.85409429227002e-05},
 {'label': 'quail', 'score': 5.502637941390276e-05}]

Visual Question Answering

from transformers import pipeline

pipeline = pipeline(task="visual-question-answering", model="Salesforce/blip-vqa-base")
pipeline(
    image="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/idefics-few-shot.jpg",
    question="What is in the image?",
)

Output:

[{'answer': 'statue of liberty'}]

Why Choose Transformers?

State-of-the-art models made accessible

  • Delivers top-tier performance across natural language understanding, computer vision, audio, and multimodal tasks.
  • Lowers the barrier to entry for researchers, engineers, and developers.
  • Requires learning only three core classes, keeping user-side abstractions to a minimum.
  • Provides a unified API for all available pretrained models.

Reduced compute costs and environmental impact

  • Leverages shared pretrained models to eliminate the need for training from scratch.
  • Minimizes compute time and production expenses.
  • Offers access to dozens of architectures and over 1 million pretrained checkpoints across all modalities.

Framework flexibility for every lifecycle stage

  • Train high-performance models with just a few lines of code.
  • Migrate a single model between PyTorch, JAX, and TensorFlow 2.0 without friction.
  • Select the best framework for training, evaluation, or production independently.

Simple customization for specific needs

  • Provides example scripts for every architecture to reproduce original published results.
  • Exposes model internals in a consistent, transparent manner.
  • Model files function independently from the library, allowing for quick, isolated experiments.

Why Not Choose Transformers?

Transformers is not designed to be a modular, general-purpose toolbox for building neural networks. To help researchers iterate quickly, model files are intentionally written without heavy refactoring or excessive abstraction layers. This allows users to modify architectures directly without navigating complex file hierarchies.

The training API is specifically optimized for PyTorch models within the Transformers ecosystem. For more general machine learning loops, consider libraries like Accelerate.

Additionally, the provided example scripts are intended as templates. They may require modification to suit specific, real-world use cases and are not guaranteed to work out-of-the-box for every scenario.

100 Projects Built with Transformers

Transformers is more than just a library; it is a thriving community of projects centered around the Hugging Face Hub. Our goal is to provide the foundation for developers, researchers, and students to bring their ideas to life.

To celebrate reaching 100,000 GitHub stars, we launched the "awesome-transformers" page. This curated list spotlights 100 exceptional projects built using the library. If you have built something—or use a tool—that belongs on that list, we encourage you to submit a pull request.

Example Models

Most models can be tested instantly through their respective pages on the Hugging Face Hub.

Audio

  • Audio classification with Whisper
  • Automatic speech recognition with Moonshine
  • Keyword spotting with Wav2Vec2
  • Speech-to-speech generation with Moshi
  • Text-to-audio with MusicGen
  • Text-to-speech with Bark

Computer Vision

  • Automatic mask generation with SAM
  • Depth estimation with DepthPro
  • Image classification with DINOv2
  • Keypoint detection and matching with SuperGlue
  • Object detection with RT-DETRv2
  • Pose estimation with VitPose
  • Universal segmentation with OneFormer
  • Video classification with VideoMAE

Multimodal

  • Audio-or-text to text with Qwen2-Audio
  • Document question answering with LayoutLMv3
  • Image-or-text to text with Qwen-VL
  • Image captioning with BLIP-2
  • OCR-based document understanding with GOT-OCR2
  • Table question answering with TAPAS
  • Unified multimodal understanding and generation with Emu3
  • Vision-to-text with Llava-OneVision
  • Visual question answering with Llava
  • Visual referring expression segmentation with Kosmos-2

NLP

  • Masked word completion with ModernBERT
  • Named entity recognition with Gemma
  • Question answering with Mixtral
  • Summarization with BART
  • Translation with T5
  • Text generation with Llama
  • Text classification with Qwen