ChatTTS: A Text-to-Speech Model Optimized for Dialogue

5月5日 Published inVoice & Speech Tools

ChatTTS is a speech synthesis model specifically designed for conversational scenarios—making it an ideal fit for LLM-based assistants. It supports both English and Chinese and was trained on a massive dataset of over 100,000 hours across both languages.

Conversational Speech Synthesis

ChatTTS is optimized for the nuances of dialogue. It produces lifelike, expressive speech and features multi-speaker capabilities, which simplifies the process of creating interactive, human-like conversations.

Fine-Grained Control

The model allows users to predict and control subtle prosodic features, including laughter, strategic pauses, and common filler words such as “uh” or “um.”

Superior Prosody

In terms of prosody—the rhythm and intonation of language—ChatTTS consistently outperforms the majority of open-source TTS models currently available.

How to Use ChatTTS

Basic Usage

from ChatTTS import Chat
from IPython.display import Audio

chat = Chat()
chat.load_models()
texts = ["<Your text here>"]
wavs = chat.infer(texts, use_decoder=True)
Audio(wavs[0], rate=24_000, autoplay=True)

Advanced: Sampling a Speaker from a Gaussian Distribution

import torch
std, mean = torch.load('ChatTTS/asset/spk_stat.pt').chunk(2)
rand_spk = torch.randn(768) * std + mean

params_infer_code = {
    'spk_emb': rand_spk,  # Your sampled speaker embedding
    'temperature': .3,
    'top_P': 0.7,
    'top_K': 20
}

Sentence-Level Manual Control

params_refine_text = {
    'prompt': '[oral_2][laugh_0][break_6]'  # Incorporating special tokens into your text
}

wav = chat.infer("<Your text here>",
                 params_refine_text=params_refine_text,
                 params_infer_code=params_infer_code)

Word-Level Manual Control

text = 'What is [uv_break]your favorite english food?[laugh][lbreak]'
wav = chat.infer(text, skip_refine_text=True,
                 params_infer_code=params_infer_code)

Example: Self-Introduction

inputs_en = """chat T T S is a text to speech model designed for dialogue applications.
[it supports mixed language input
and offers multi speaker capabilities with precise control over prosodic elements
[laugh]like like
laughter[laugh],
[uv_break]pauses,
[uv_break]and intonation.
[it delivers natural and expressive speech,
so please
 use the project responsibly at your own risk.]""" .replace(' \n ', '')

params_refine_text = {
    'prompt': '[oral_2][laugh_0][break_4]'
}

audio_array_cn = chat.infer(inputs_cn, params_refine_text=params_refine_text)
audio_array_en = chat.infer(inputs_en, params_refine_text=params_refine_text)

Hardware & Performance

Generating a 30-second audio clip requires a minimum of 4GB of GPU memory. On an NVIDIA RTX 4090D, the model processes roughly 7 semantic tokens per second. The Real-Time Factor (RTF) is approximately 0.65.

Note that the model is still in development and may exhibit occasional instability. Users might encounter issues such as unexpected speaker switching or fluctuations in audio quality. These behaviors are common in autoregressive models like Bark or VALL-E and are difficult to eliminate entirely. For best results, it is often helpful to run several samples to find the highest-quality output.