Index-TTS-LoRA: Fine-Tuning Voice Models for Natural Speech Synthesis

9月30日 Published inVoice & Speech Tools

Index-TTS-LoRA is a specialized toolkit designed for Bilibili’s Index-TTS model. It supports LoRA (Low-Rank Adaptation) fine-tuning for both single and multi-speaker configurations, resulting in synthesized speech with improved naturalness and more fluid prosody. This comprehensive pipeline handles everything from audio token extraction and speaker conditioning to training and inference.

Extract Audio Tokens & Speaker Conditions

To begin the process, execute the following command:

python tools/extract_codec.py --audio_list ${audio_list} --extract_condition

The audio_list file should contain pairs of audio file paths and their corresponding transcriptions, separated by a tab (\t). For example:

/path/to/audio.wav  小朋友们,大家好,我是凯叔,今天我们讲一个龟兔赛跑的故事。

Once the extraction is complete, the processed files and speaker_info.json will be saved in finetune_data/processed_data/. Below is a sample speaker_info.json file:

[
    {
        "speaker": "kaishu_30min",
        "avg_duration": 6.6729,
        "sample_num": 270,
        "total_duration_in_seconds": 1801.696,
        "total_duration_in_minutes": 30.028,
        "total_duration_in_hours": 0.500,
        "train_jsonl": "/path/to/kaishu_30min/metadata_train.jsonl",
        "valid_jsonl": "/path/to/kaishu_30min/metadata_valid.jsonl",
        "medoid_condition": "/path/to/kaishu_30min/medoid_condition.npy"
    }
]

Training

To start the fine-tuning process, run:

python train.py

Inference

Generate speech using the following command:

python indextts/infer.py

Speech Synthesis Examples

Text Audio File
The old house clock stopped at 3 a.m. Dust drifted as a trail of strange footprints appeared. The detective knelt down and spotted a blood-stained ring hidden between floorboards. kaishu_cn_1.wav
Under moonlight, a pumpkin sprouted a grinning face. Vines twisted, pushing open the garden gate. A little girl tiptoed closer, hearing mushrooms hum an ancient lullaby. kaishu_cn_2.wav
For intermediate Java learning, you'll need to cover database management and external front-end application development, including JavaScript and dynamic websites. kaishu_cn_en_mix_1.wav
This financial report breaks down the company's revenue performance and expenditure trends over the past quarter. kaishu_cn_en_mix_2.wav
Up the hill, down the hill, three miles and three meters, climbed a high mountain at 330 meters above sea level. At the top, I shouted: "I'm three feet three taller than this mountain!" kaishu_raokouling.wav
A thin man lies against the side of the street with his shirt and a shoe off and bags nearby. kaishu_en_1.wav
As research continued, the protective effect of fluoride against dental decay was demonstrated. kaishu_en_2.wav

2. Model Evaluation

For a deeper dive into the evaluation metrics, refer to the 2025 Benchmark of Mainstream TTS Models: Who Is the Best Voice Synthesis Solution? The table below details the Word Error Rate (WER) across various test sets.

Model test_zh (WER) test_zh_en_mix (WER) seed_test_zh (WER) seed_test_en (WER) seed_test_hard (WER)
IndexTTS-1.5 base model (zero-shot) 2.46 3.56 1.28 2.08 6.27
IndexTTS-1.5 Kaishu voice LoRA model 2.55 3.76 1.67 2.52 8.89

Note: The zero-shot results for the base model were obtained using prompts taken directly from the test set.