ChatTTS is a speech synthesis model specifically designed for conversational scenarios—making it an ideal fit for LLM-based assistants. It supports both English and Chinese and was trained on a massive dataset of over 100,000 hours across both languages.
Conversational Speech Synthesis
ChatTTS is optimized for the nuances of dialogue. It produces lifelike, expressive speech and features multi-speaker capabilities, which simplifies the process of creating interactive, human-like conversations.
Fine-Grained Control
The model allows users to predict and control subtle prosodic features, including laughter, strategic pauses, and common filler words such as “uh” or “um.”
Superior Prosody
In terms of prosody—the rhythm and intonation of language—ChatTTS consistently outperforms the majority of open-source TTS models currently available.
Basic Usage
from ChatTTS import Chat
from IPython.display import Audio
chat = Chat()
chat.load_models()
texts = ["<Your text here>"]
wavs = chat.infer(texts, use_decoder=True)
Audio(wavs[0], rate=24_000, autoplay=True)
Advanced: Sampling a Speaker from a Gaussian Distribution
import torch
std, mean = torch.load('ChatTTS/asset/spk_stat.pt').chunk(2)
rand_spk = torch.randn(768) * std + mean
params_infer_code = {
'spk_emb': rand_spk, # Your sampled speaker embedding
'temperature': .3,
'top_P': 0.7,
'top_K': 20
}
Sentence-Level Manual Control
params_refine_text = {
'prompt': '[oral_2][laugh_0][break_6]' # Incorporating special tokens into your text
}
wav = chat.infer("<Your text here>",
params_refine_text=params_refine_text,
params_infer_code=params_infer_code)
Word-Level Manual Control
text = 'What is [uv_break]your favorite english food?[laugh][lbreak]'
wav = chat.infer(text, skip_refine_text=True,
params_infer_code=params_infer_code)
Example: Self-Introduction
inputs_en = """chat T T S is a text to speech model designed for dialogue applications.
[it supports mixed language input
and offers multi speaker capabilities with precise control over prosodic elements
[laugh]like like
laughter[laugh],
[uv_break]pauses,
[uv_break]and intonation.
[it delivers natural and expressive speech,
so please
use the project responsibly at your own risk.]""" .replace(' \n ', '')
params_refine_text = {
'prompt': '[oral_2][laugh_0][break_4]'
}
audio_array_cn = chat.infer(inputs_cn, params_refine_text=params_refine_text)
audio_array_en = chat.infer(inputs_en, params_refine_text=params_refine_text)
Hardware & Performance
Generating a 30-second audio clip requires a minimum of 4GB of GPU memory. On an NVIDIA RTX 4090D, the model processes roughly 7 semantic tokens per second. The Real-Time Factor (RTF) is approximately 0.65.
Note that the model is still in development and may exhibit occasional instability. Users might encounter issues such as unexpected speaker switching or fluctuations in audio quality. These behaviors are common in autoregressive models like Bark or VALL-E and are difficult to eliminate entirely. For best results, it is often helpful to run several samples to find the highest-quality output.
withoutbg: Free Local & API-Based AI Background Removal Tool
AhaSpeed VPN Review: High-Speed Performance, No Ads, and Unlimited Bandwidth
Mars3D Vue Examples: 381 Interactive 3D Map Demos and Live Code Editing
Perplexica: The Open-Source AI Search Engine Powered by Your Own LLMs
OpenHands: The AI Agent That Writes Code and Executes Commands
UTCP Explained: A Universal Tool Calling Protocol for APIs, LLMs, and Beyond
Agents From Scratch: AI Email Assistant with Human-in-the-Loop Approval
ERPNext Open Source ERP: Installation Guide for Accounting and Inventory
Immich Setup Guide: How to Self-Host Your Own Google Photos Alternative
KVoiceWalk: Clone Any Voice for Kokoro TTS Using Random Walks
AutoGenLib: Generate Python Code on the Fly With OpenAI API
How to Highlight Top 3 and Bottom 3 Bars in an Excel Chart