ChatTTS is a speech synthesis model specifically designed for conversational scenarios—making it an ideal fit for LLM-based assistants. It supports both English and Chinese and was trained on a massive dataset of over 100,000 hours across both languages.
Conversational Speech Synthesis
ChatTTS is optimized for the nuances of dialogue. It produces lifelike, expressive speech and features multi-speaker capabilities, which simplifies the process of creating interactive, human-like conversations.
Fine-Grained Control
The model allows users to predict and control subtle prosodic features, including laughter, strategic pauses, and common filler words such as “uh” or “um.”
Superior Prosody
In terms of prosody—the rhythm and intonation of language—ChatTTS consistently outperforms the majority of open-source TTS models currently available.
Basic Usage
from ChatTTS import Chat
from IPython.display import Audio
chat = Chat()
chat.load_models()
texts = ["<Your text here>"]
wavs = chat.infer(texts, use_decoder=True)
Audio(wavs[0], rate=24_000, autoplay=True)
Advanced: Sampling a Speaker from a Gaussian Distribution
import torch
std, mean = torch.load('ChatTTS/asset/spk_stat.pt').chunk(2)
rand_spk = torch.randn(768) * std + mean
params_infer_code = {
'spk_emb': rand_spk, # Your sampled speaker embedding
'temperature': .3,
'top_P': 0.7,
'top_K': 20
}
Sentence-Level Manual Control
params_refine_text = {
'prompt': '[oral_2][laugh_0][break_6]' # Incorporating special tokens into your text
}
wav = chat.infer("<Your text here>",
params_refine_text=params_refine_text,
params_infer_code=params_infer_code)
Word-Level Manual Control
text = 'What is [uv_break]your favorite english food?[laugh][lbreak]'
wav = chat.infer(text, skip_refine_text=True,
params_infer_code=params_infer_code)
Example: Self-Introduction
inputs_en = """chat T T S is a text to speech model designed for dialogue applications.
[it supports mixed language input
and offers multi speaker capabilities with precise control over prosodic elements
[laugh]like like
laughter[laugh],
[uv_break]pauses,
[uv_break]and intonation.
[it delivers natural and expressive speech,
so please
use the project responsibly at your own risk.]""" .replace(' \n ', '')
params_refine_text = {
'prompt': '[oral_2][laugh_0][break_4]'
}
audio_array_cn = chat.infer(inputs_cn, params_refine_text=params_refine_text)
audio_array_en = chat.infer(inputs_en, params_refine_text=params_refine_text)
Hardware & Performance
Generating a 30-second audio clip requires a minimum of 4GB of GPU memory. On an NVIDIA RTX 4090D, the model processes roughly 7 semantic tokens per second. The Real-Time Factor (RTF) is approximately 0.65.
Note that the model is still in development and may exhibit occasional instability. Users might encounter issues such as unexpected speaker switching or fluctuations in audio quality. These behaviors are common in autoregressive models like Bark or VALL-E and are difficult to eliminate entirely. For best results, it is often helpful to run several samples to find the highest-quality output.
AI Multi-Agent Stock Trading System: GPT-5 and Claude 4.5 Sonnet
DeepSeek OCR: Extract Text and Visual Data With This React FastAPI App
ETF Grid Trading Strategy Design Tool: Smart Parameters & Risk Control
SE-Agent: Self-Evolving AI Agent Tops SWE-bench Verified
Coze Studio: Build and Deploy AI Agents with Golang and React
OxyGent: Build Multi-Agent Systems That Learn and Scale Without YAML
PandaWiki Setup Guide: Building an AI-Powered Knowledge Base
Common Ground: Multi-Agent Collaboration That Actually Works
SerenityOS Build Guide: A C++ Unix-Like System for x86, Arm, and RISC-V
BAGEL 7B MoT: The Open Multimodal Model Outperforming Qwen2.5-VL
MusicFree: A Modular Open-Source Music Player for Android and HarmonyOS
ONLYOFFICE Docs: A Powerful Online Collaborative Office Suite