VibeVoice is a text-to-speech (TTS) framework designed to generate expressive, long-form dialogue involving multiple speakers—similar in quality and structure to a podcast. The model addresses three critical limitations found in traditional TTS systems: scalability, speaker consistency, and the natural rhythm of human conversation.
The system relies on two core technical innovations: an ultra-low-frame-rate continuous speech tokenizer that processes both acoustic and semantic information, paired with a next-token diffusion framework. This architecture leverages a large language model to interpret context and maintain dialogue flow. Consequently, the model can produce up to 90 minutes of continuous synthetic speech featuring as many as four distinct voices.
1p_EN2CH.mp4 – A single speaker transitioning from English to Chinese.2p_see_u_again.mp4 – Two speakers performing a spontaneous song.4p_climate_45min.mp4 – A four-speaker conversation regarding climate change lasting 45 minutes.For managing CUDA dependencies, we recommend using NVIDIA’s deep learning containers.
Launch a Docker container. Select version 24.07, 24.10, or 24.12 of the NVIDIA PyTorch container (newer versions are likely compatible). Execute the following:
sudo docker run --privileged --net=host --ipc=host --ulimit memlock=-1:-1 --ulimit stack=-1:-1 --gpus all --rm -it nvcr.io/nvidia/pytorch:24.07-py3
Install Flash Attention. If Flash Attention is not included in your environment, install it manually. Refer to the [flash-attention](github.com/Dao-AILab/flash-attention) repository for details, then run:
pip install flash-attn --no-build-isolation
Clone and install the repository:
git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice/
pip install -e .
Launch the Gradio interface:
Update the package list and install ffmpeg (apt update && apt install ffmpeg -y), then run:
python demo/gradio_demo.py --model_path microsoft/VibeVoice-1.5B --share
Execute inference via a text file:
Reference scripts are located in demo/text_examples/.
For a single speaker:
python demo/inference_from_file.py --model_path microsoft/VibeVoice-1.5B --txt_path demo/text_examples/1p_abs.txt --speaker_names Alice
For multiple speakers:
python demo/inference_from_file.py --model_path microsoft/VibeVoice-1.5B --txt_path demo/text_examples/2p_zh.txt --speaker_names Alice Yunfan
Deepfake and Misinformation Risks: High-quality synthetic speech carries a risk of potential misuse. Users are responsible for verifying the accuracy of transcripts and avoiding deceptive applications. Ensure compliance with all local regulations and clearly disclose the use of AI when sharing generated content.
Language Support: The model currently supports English and Chinese only. Attempting to use other languages may yield unpredictable or low-quality results.
Non-Speech Audio: VibeVoice is designed specifically for vocal synthesis. It does not generate background noise, music, or sound effects.
Overlapping Speech: The current iteration of the model does not support or generate overlapping dialogue between speakers.
VibeVoice is intended strictly for research purposes. Do not use the model for commercial applications or production environments without extensive further testing and development. Please use this technology responsibly.
AI Multi-Agent Stock Trading System: GPT-5 and Claude 4.5 Sonnet
Lanjing VPN Review: Unlimited Traffic, CN2 Lines, and Smart Routing
BitzNet SD-WAN: Secure SD-WAN for Faster, Safer Internet Access
Open Deep Research: Customizable AI Agents for Automated Report Generation
Gemini-CLI-UI: A Web Interface for the Google Gemini CLI Coding Assistant
syftr: Optimize Agent Workflows with Pareto Front Search
BiliNote: Convert YouTube and Bilibili Videos Into Markdown Notes
ALLinSSL: Automated SSL Certificate Lifecycle Management
MCP SuperAssistant: Bring MCP Tools to ChatGPT, Gemini, and Beyond
Wasteland SLG Guide: Survival Tips & Alliance Strategy
IOPaint: Free Open-Source Image Inpainting and Object Removal
XMIF VPN Free Trial & $0.70/Month Plan – No Logs, 4K Speed