VibeVoice is a text-to-speech (TTS) framework designed to generate expressive, long-form dialogue involving multiple speakers—similar in quality and structure to a podcast. The model addresses three critical limitations found in traditional TTS systems: scalability, speaker consistency, and the natural rhythm of human conversation.
The system relies on two core technical innovations: an ultra-low-frame-rate continuous speech tokenizer that processes both acoustic and semantic information, paired with a next-token diffusion framework. This architecture leverages a large language model to interpret context and maintain dialogue flow. Consequently, the model can produce up to 90 minutes of continuous synthetic speech featuring as many as four distinct voices.
1p_EN2CH.mp4 – A single speaker transitioning from English to Chinese.2p_see_u_again.mp4 – Two speakers performing a spontaneous song.4p_climate_45min.mp4 – A four-speaker conversation regarding climate change lasting 45 minutes.For managing CUDA dependencies, we recommend using NVIDIA’s deep learning containers.
Launch a Docker container. Select version 24.07, 24.10, or 24.12 of the NVIDIA PyTorch container (newer versions are likely compatible). Execute the following:
sudo docker run --privileged --net=host --ipc=host --ulimit memlock=-1:-1 --ulimit stack=-1:-1 --gpus all --rm -it nvcr.io/nvidia/pytorch:24.07-py3
Install Flash Attention. If Flash Attention is not included in your environment, install it manually. Refer to the [flash-attention](github.com/Dao-AILab/flash-attention) repository for details, then run:
pip install flash-attn --no-build-isolation
Clone and install the repository:
git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice/
pip install -e .
Launch the Gradio interface:
Update the package list and install ffmpeg (apt update && apt install ffmpeg -y), then run:
python demo/gradio_demo.py --model_path microsoft/VibeVoice-1.5B --share
Execute inference via a text file:
Reference scripts are located in demo/text_examples/.
For a single speaker:
python demo/inference_from_file.py --model_path microsoft/VibeVoice-1.5B --txt_path demo/text_examples/1p_abs.txt --speaker_names Alice
For multiple speakers:
python demo/inference_from_file.py --model_path microsoft/VibeVoice-1.5B --txt_path demo/text_examples/2p_zh.txt --speaker_names Alice Yunfan
Deepfake and Misinformation Risks: High-quality synthetic speech carries a risk of potential misuse. Users are responsible for verifying the accuracy of transcripts and avoiding deceptive applications. Ensure compliance with all local regulations and clearly disclose the use of AI when sharing generated content.
Language Support: The model currently supports English and Chinese only. Attempting to use other languages may yield unpredictable or low-quality results.
Non-Speech Audio: VibeVoice is designed specifically for vocal synthesis. It does not generate background noise, music, or sound effects.
Overlapping Speech: The current iteration of the model does not support or generate overlapping dialogue between speakers.
VibeVoice is intended strictly for research purposes. Do not use the model for commercial applications or production environments without extensive further testing and development. Please use this technology responsibly.
DeepSeek-OCR WebUI: Batch OCR with Markdown Tables and Visual Bounding Boxes
AI Trading Simulator: Paper Trade Crypto With Smart LLM Decisions
Dayflow Mac App Review: Turn Screen Time Into an AI Timeline
Fast RAG: Deploy a Private Hybrid Search RAG Stack Locally
Qwen3-ASR-Toolkit: Transcribe Long Audio Files Beyond the 3-Minute Limit
DeepDoc Turns Local Files Into AI Research Reports (No Cloud Needed)
Halo Docker Compose Deployment Guide – Requirements & Setup
Agents From Scratch: AI Email Assistant with Human-in-the-Loop Approval
Fay: Build and Deploy Your Own Talking Digital Human for Free
AppFlowy: Open-Source Notion Alternative With Local Data Control
BAGEL 7B MoT: The Open Multimodal Model Outperforming Qwen2.5-VL
LiveTerm: A Next.js Terminal-Style Website Template