Index-TTS-LoRA is a specialized toolkit designed for Bilibili’s Index-TTS model. It supports LoRA (Low-Rank Adaptation) fine-tuning for both single and multi-speaker configurations, resulting in synthesized speech with improved naturalness and more fluid prosody. This comprehensive pipeline handles everything from audio token extraction and speaker conditioning to training and inference.
To begin the process, execute the following command:
python tools/extract_codec.py --audio_list ${audio_list} --extract_condition
The audio_list file should contain pairs of audio file paths and their corresponding transcriptions, separated by a tab (\t). For example:
/path/to/audio.wav 小朋友们,大家好,我是凯叔,今天我们讲一个龟兔赛跑的故事。
Once the extraction is complete, the processed files and speaker_info.json will be saved in finetune_data/processed_data/. Below is a sample speaker_info.json file:
[
{
"speaker": "kaishu_30min",
"avg_duration": 6.6729,
"sample_num": 270,
"total_duration_in_seconds": 1801.696,
"total_duration_in_minutes": 30.028,
"total_duration_in_hours": 0.500,
"train_jsonl": "/path/to/kaishu_30min/metadata_train.jsonl",
"valid_jsonl": "/path/to/kaishu_30min/metadata_valid.jsonl",
"medoid_condition": "/path/to/kaishu_30min/medoid_condition.npy"
}
]
To start the fine-tuning process, run:
python train.py
Generate speech using the following command:
python indextts/infer.py
| Text | Audio File |
|---|---|
| The old house clock stopped at 3 a.m. Dust drifted as a trail of strange footprints appeared. The detective knelt down and spotted a blood-stained ring hidden between floorboards. | kaishu_cn_1.wav |
| Under moonlight, a pumpkin sprouted a grinning face. Vines twisted, pushing open the garden gate. A little girl tiptoed closer, hearing mushrooms hum an ancient lullaby. | kaishu_cn_2.wav |
| For intermediate Java learning, you'll need to cover database management and external front-end application development, including JavaScript and dynamic websites. | kaishu_cn_en_mix_1.wav |
| This financial report breaks down the company's revenue performance and expenditure trends over the past quarter. | kaishu_cn_en_mix_2.wav |
| Up the hill, down the hill, three miles and three meters, climbed a high mountain at 330 meters above sea level. At the top, I shouted: "I'm three feet three taller than this mountain!" | kaishu_raokouling.wav |
| A thin man lies against the side of the street with his shirt and a shoe off and bags nearby. | kaishu_en_1.wav |
| As research continued, the protective effect of fluoride against dental decay was demonstrated. | kaishu_en_2.wav |
For a deeper dive into the evaluation metrics, refer to the 2025 Benchmark of Mainstream TTS Models: Who Is the Best Voice Synthesis Solution? The table below details the Word Error Rate (WER) across various test sets.
| Model | test_zh (WER) | test_zh_en_mix (WER) | seed_test_zh (WER) | seed_test_en (WER) | seed_test_hard (WER) |
|---|---|---|---|---|---|
| IndexTTS-1.5 base model (zero-shot) | 2.46 | 3.56 | 1.28 | 2.08 | 6.27 |
| IndexTTS-1.5 Kaishu voice LoRA model | 2.55 | 3.76 | 1.67 | 2.52 | 8.89 |
Note: The zero-shot results for the base model were obtained using prompts taken directly from the test set.
Shanlian VPN Review: High-Speed, Private & Optimized for China
HackGPT Enterprise Review: AI-Native Pentesting for Security Teams
Kode CLI: A Multi-Model AI Terminal Assistant for Smarter Development
Mars3D Vue Examples: 381 Interactive 3D Map Demos and Live Code Editing
NeuralAgent: An Open-Source AI Agent for Native Desktop Automation
Perplexica: The Open-Source AI Search Engine Powered by Your Own LLMs
UTCP Explained: A Universal Tool Calling Protocol for APIs, LLMs, and Beyond
Memvid: Store Millions of Text Chunks in a Single MP4 File
Gmail AutoAuth MCP Server: Control Gmail via Claude Desktop
Deploying AI Manus: Docker Compose Setup & Development Guide
How to Build a Meeting Prep Agent with Tavily and Google Calendar
ChatTTS: A Text-to-Speech Model Optimized for Dialogue