IndexTTS2 addresses a persistent issue in modern speech synthesis: the difficulty of achieving precise speech timing. Instead of relying on duration estimates, it utilizes a token-counting method that fixes speech length while preserving the original speaker's rhythm and intonation. This ensures targeted durations without distorting the natural cadence of the prompt.
The model also decouples speaker identity from emotional tone, allowing for independent adjustments to both. In zero-shot mode, it reconstructs a target timbre and applies a specific emotional delivery—such as anger, sadness, or excitement—without the two attributes bleeding into each other. To maintain clarity in high-emotion clips, the system incorporates implicit GPT representations and a three-stage training regimen. A soft-prompt configuration, fine-tuned from Qwen3, simplifies emotional control; users can direct the output using basic prompts, and the model manages the complex mapping.
IndexTTS2 outperforms existing zero-shot TTS systems, demonstrating lower word error rates, higher speaker similarity, and superior emotional fidelity. The release includes a web-based demo and a Python API. Emotional inputs are flexible, accepting reference clips, standalone emotional samples, 8-dimensional vectors, or plain text.
The pipeline integrates several components: a 45E condition vector, a text-speech language model, a BigVGAN2 decoder, speaker vectors, text tokens, and both 5-dimensional and 8-dimensional latent acoustic tokens. Together, these elements synthesize the final waveform.
Key design features include:
Word Error Rate (Seed-Test)
| Model | test_zh | test_en | test_hard |
|---|---|---|---|
| Human | 1.26 | 2.14 | - |
| SeedTTS | 1.002 | 1.945 | 6.243 |
| CosyVoice 2 | 1.45 | 2.57 | 6.83 |
| F5TTS | 1.56 | 1.83 | 8.67 |
| FireRedTTS | 1.51 | 3.82 | 17.45 |
| MaskGCT | 2.27 | 2.62 | 10.27 |
| Spark-TTS | 1.2 | 1.98 | - |
| MegaTTS 3 | 1.36 | 1.82 | - |
| IndexTTS | 0.937 | 1.936 | 6.831 |
| IndexTTS-1.5 | 0.821 | 1.606 | 6.565 |
Word Error Rate (Public Benchmarks)
| Model | aishell1 | cv_zh | cv_en | librispeech | Avg |
|---|---|---|---|---|---|
| Human | 2.0 | 9.5 | 10.0 | 2.4 | 5.1 |
| CosyVoice 2 | 1.8 | 9.1 | 7.3 | 4.9 | 5.9 |
| F5TTS | 3.9 | 11.7 | 5.4 | 7.8 | 8.2 |
| Fishspeech | 2.4 | 11.4 | 8.8 | 8.0 | 8.3 |
| FireRedTTS | 2.2 | 11.0 | 16.3 | 5.7 | 7.7 |
| XTTS | 3.0 | 11.4 | 7.1 | 3.5 | 6.0 |
| IndexTTS | 1.3 | 7.0 | 5.3 | 2.1 | 3.7 |
| IndexTTS-1.5 | 1.2 | 6.8 | 3.9 | 1.7 | 3.1 |
Speaker Similarity
| Model | aishell1 | cv_zh | cv_en | librispeech | Avg |
|---|---|---|---|---|---|
| Human | 0.846 | 0.809 | 0.820 | 0.858 | 0.836 |
| CosyVoice 2 | 0.796 | 0.743 | 0.742 | 0.837 | 0.788 |
| F5TTS | 0.743 | 0.747 | 0.746 | 0.828 | 0.779 |
| Fishspeech | 0.488 | 0.552 | 0.622 | 0.701 | 0.612 |
| FireRedTTS | 0.579 | 0.593 | 0.587 | 0.698 | 0.631 |
| XTTS | 0.573 | 0.586 | 0.648 | 0.761 | 0.663 |
| IndexTTS | 0.744 | 0.742 | 0.758 | 0.823 | 0.776 |
| IndexTTS-1.5 | 0.741 | 0.722 | 0.753 | 0.819 | 0.771 |
Zero-Shot Voice Cloning (MOS Scores)
| Model | Prosody | Timbre | Quality | AVG |
|---|---|---|---|---|
| CosyVoice 2 | 3.67 | 4.05 | 3.73 | 3.81 |
| F5TTS | 3.56 | 3.88 | 3.56 | 3.66 |
| Fishspeech | 3.40 | 3.63 | 3.69 | 3.57 |
| FireRedTTS | 3.79 | 3.72 | 3.60 | 3.70 |
| XTTS | 3.23 | 2.99 | 3.10 | 3.11 |
| IndexTTS | 3.79 | 4.20 | 4.05 | 4.01 |
Clone the repository and create a new Conda environment:
git clone https://github.com/index-tts/index-tts.git
conda create -n index-tts python=3.10
conda activate index-tts
apt-get install ffmpeg
# or: conda install -c conda-forge ffmpeg
Install PyTorch (example for CUDA 11.8):
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu118
Windows users encountering pynini build errors should install it via Conda:
conda install -c conda-forge pynini==2.1.6
pip install WeTextProcessing --no-deps
Install IndexTTS as an editable package:
cd index-tts
pip install -e .
Use the huggingface-cli:
huggingface-cli download IndexTeam/IndexTTS-1.5 \
config.yaml bigvgan_discriminator.pth bigvgan_generator.pth bpe.model dvae.pth gpt.pth unigram_12000.vocab \
--local-dir checkpoints
For faster downloads in China, use the mirror endpoint:
export HF_ENDPOINT="https://hf-mirror.com"
Place a reference audio file named input.wav in the test_data folder, then run:
python indextts/infer.py
Command-line usage:
indextts "Your text here." --voice reference_voice.wav --model_dir checkpoints --config checkpoints/config.yaml --output output.wav
pip install -e ".[webui]" --no-build-isolation
python webui.py
Access the interface at http://127.0.0.1:7860.
from indextts.infer import IndexTTS
tts = IndexTTS(model_dir="checkpoints", cfg_path="checkpoints/config.yaml")
voice = "reference_voice.wav"
text = "Your long-form text goes here."
tts.infer(voice, text, "output.wav")
Skill Seeker: Convert Any Documentation Site Into Claude AI Skills
Octo: A Zero-Telemetry Coding Assistant with Smart Auto-Repair
Shanlian VPN Review: High-Speed, Private & Optimized for China
Checkmate: Open-Source Server Monitoring with Uptime Alerts
Larachat: Build a Real-Time AI Chat App with Laravel and React
KVoiceWalk: Clone Any Voice for Kokoro TTS Using Random Walks
AI看线: A-Share Analysis with K-Line Charts and Gemini AI Forecasts
Microsoft’s NLWeb: Converting Any Website into a Conversational Interface
SmartPDF: Summarize PDFs with Llama 3.3
DeepWiki: Automatically Generate Interactive Wikis for Any GitHub Repository
Wasteland SLG Guide: Survival Tips & Alliance Strategy
LiebaoVPN: Fast, Private, and Ad-Free – The Top VPN for 2025