IndexTTS2 Zero Shot Voice Cloning Beats Benchmarks for Accuracy and Emotion

9月12日 Published inText-to-Speech Tools

IndexTTS2 addresses a persistent issue in modern speech synthesis: the difficulty of achieving precise speech timing. Instead of relying on duration estimates, it utilizes a token-counting method that fixes speech length while preserving the original speaker's rhythm and intonation. This ensures targeted durations without distorting the natural cadence of the prompt.

The model also decouples speaker identity from emotional tone, allowing for independent adjustments to both. In zero-shot mode, it reconstructs a target timbre and applies a specific emotional delivery—such as anger, sadness, or excitement—without the two attributes bleeding into each other. To maintain clarity in high-emotion clips, the system incorporates implicit GPT representations and a three-stage training regimen. A soft-prompt configuration, fine-tuned from Qwen3, simplifies emotional control; users can direct the output using basic prompts, and the model manages the complex mapping.

IndexTTS2 outperforms existing zero-shot TTS systems, demonstrating lower word error rates, higher speaker similarity, and superior emotional fidelity. The release includes a web-based demo and a Python API. Emotional inputs are flexible, accepting reference clips, standalone emotional samples, 8-dimensional vectors, or plain text.

Technical Architecture

The pipeline integrates several components: a 45E condition vector, a text-speech language model, a BigVGAN2 decoder, speaker vectors, text tokens, and both 5-dimensional and 8-dimensional latent acoustic tokens. Together, these elements synthesize the final waveform.

Key design features include:

  1. Character-Pinyin Hybrid for Chinese: This modeling approach identifies and corrects mispronunciations in real-time.
  2. Conformer + BigVGAN2 Backbone: This combination stabilizes the training process and improves voice similarity and audio fidelity.
  3. Open Evaluation Suites: All evaluation sets—including polysyllabic word tests and subjective/objective metrics—are published to allow for transparent replication and research.

Performance Metrics

Word Error Rate (Seed-Test)

Model test_zh test_en test_hard
Human 1.26 2.14 -
SeedTTS 1.002 1.945 6.243
CosyVoice 2 1.45 2.57 6.83
F5TTS 1.56 1.83 8.67
FireRedTTS 1.51 3.82 17.45
MaskGCT 2.27 2.62 10.27
Spark-TTS 1.2 1.98 -
MegaTTS 3 1.36 1.82 -
IndexTTS 0.937 1.936 6.831
IndexTTS-1.5 0.821 1.606 6.565

Word Error Rate (Public Benchmarks)

Model aishell1 cv_zh cv_en librispeech Avg
Human 2.0 9.5 10.0 2.4 5.1
CosyVoice 2 1.8 9.1 7.3 4.9 5.9
F5TTS 3.9 11.7 5.4 7.8 8.2
Fishspeech 2.4 11.4 8.8 8.0 8.3
FireRedTTS 2.2 11.0 16.3 5.7 7.7
XTTS 3.0 11.4 7.1 3.5 6.0
IndexTTS 1.3 7.0 5.3 2.1 3.7
IndexTTS-1.5 1.2 6.8 3.9 1.7 3.1

Speaker Similarity

Model aishell1 cv_zh cv_en librispeech Avg
Human 0.846 0.809 0.820 0.858 0.836
CosyVoice 2 0.796 0.743 0.742 0.837 0.788
F5TTS 0.743 0.747 0.746 0.828 0.779
Fishspeech 0.488 0.552 0.622 0.701 0.612
FireRedTTS 0.579 0.593 0.587 0.698 0.631
XTTS 0.573 0.586 0.648 0.761 0.663
IndexTTS 0.744 0.742 0.758 0.823 0.776
IndexTTS-1.5 0.741 0.722 0.753 0.819 0.771

Zero-Shot Voice Cloning (MOS Scores)

Model Prosody Timbre Quality AVG
CosyVoice 2 3.67 4.05 3.73 3.81
F5TTS 3.56 3.88 3.56 3.66
Fishspeech 3.40 3.63 3.69 3.57
FireRedTTS 3.79 3.72 3.60 3.70
XTTS 3.23 2.99 3.10 3.11
IndexTTS 3.79 4.20 4.05 4.01

Quick Start: IndexTTS2 Setup

Environment

Clone the repository and create a new Conda environment:

git clone https://github.com/index-tts/index-tts.git
conda create -n index-tts python=3.10
conda activate index-tts
apt-get install ffmpeg
# or: conda install -c conda-forge ffmpeg

Install PyTorch (example for CUDA 11.8):

pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu118

Windows users encountering pynini build errors should install it via Conda:

conda install -c conda-forge pynini==2.1.6
pip install WeTextProcessing --no-deps

Install IndexTTS as an editable package:

cd index-tts
pip install -e .

Download Models

Use the huggingface-cli:

huggingface-cli download IndexTeam/IndexTTS-1.5 \
  config.yaml bigvgan_discriminator.pth bigvgan_generator.pth bpe.model dvae.pth gpt.pth unigram_12000.vocab \
  --local-dir checkpoints

For faster downloads in China, use the mirror endpoint:

export HF_ENDPOINT="https://hf-mirror.com"

Inference

Place a reference audio file named input.wav in the test_data folder, then run:

python indextts/infer.py

Command-line usage:

indextts "Your text here." --voice reference_voice.wav --model_dir checkpoints --config checkpoints/config.yaml --output output.wav

Web Demo

pip install -e ".[webui]" --no-build-isolation
python webui.py

Access the interface at http://127.0.0.1:7860.

Python API Example

from indextts.infer import IndexTTS

tts = IndexTTS(model_dir="checkpoints", cfg_path="checkpoints/config.yaml")
voice = "reference_voice.wav"
text = "Your long-form text goes here."

tts.infer(voice, text, "output.wav")