VibeVoice: Long-Form Multi-Speaker TTS for Natural Dialogue Generation

8月26日 Published inVoice & Speech Tools

VibeVoice is a text-to-speech (TTS) framework designed to generate expressive, long-form dialogue involving multiple speakers—similar in quality and structure to a podcast. The model addresses three critical limitations found in traditional TTS systems: scalability, speaker consistency, and the natural rhythm of human conversation.

The system relies on two core technical innovations: an ultra-low-frame-rate continuous speech tokenizer that processes both acoustic and semantic information, paired with a next-token diffusion framework. This architecture leverages a large language model to interpret context and maintain dialogue flow. Consequently, the model can produce up to 90 minutes of continuous synthetic speech featuring as many as four distinct voices.

VibeVoice Demos

  • Cross-language capabilities: 1p_EN2CH.mp4 – A single speaker transitioning from English to Chinese.
  • Impromptu singing: 2p_see_u_again.mp4 – Two speakers performing a spontaneous song.
  • Extended group discussion: 4p_climate_45min.mp4 – A four-speaker conversation regarding climate change lasting 45 minutes.

Installation & Usage

Setup

For managing CUDA dependencies, we recommend using NVIDIA’s deep learning containers.

  1. Launch a Docker container. Select version 24.07, 24.10, or 24.12 of the NVIDIA PyTorch container (newer versions are likely compatible). Execute the following:

    sudo docker run --privileged --net=host --ipc=host --ulimit memlock=-1:-1 --ulimit stack=-1:-1 --gpus all --rm -it nvcr.io/nvidia/pytorch:24.07-py3
    
  2. Install Flash Attention. If Flash Attention is not included in your environment, install it manually. Refer to the [flash-attention](github.com/Dao-AILab/flash-attention) repository for details, then run:

    pip install flash-attn --no-build-isolation
    
  3. Clone and install the repository:

    git clone https://github.com/microsoft/VibeVoice.git
    cd VibeVoice/
    pip install -e .
    

Running VibeVoice

Launch the Gradio interface: Update the package list and install ffmpeg (apt update && apt install ffmpeg -y), then run:

python demo/gradio_demo.py --model_path microsoft/VibeVoice-1.5B --share

Execute inference via a text file: Reference scripts are located in demo/text_examples/.

For a single speaker:

python demo/inference_from_file.py --model_path microsoft/VibeVoice-1.5B --txt_path demo/text_examples/1p_abs.txt --speaker_names Alice

For multiple speakers:

python demo/inference_from_file.py --model_path microsoft/VibeVoice-1.5B --txt_path demo/text_examples/2p_zh.txt --speaker_names Alice Yunfan

Caveats

  • Deepfake and Misinformation Risks: High-quality synthetic speech carries a risk of potential misuse. Users are responsible for verifying the accuracy of transcripts and avoiding deceptive applications. Ensure compliance with all local regulations and clearly disclose the use of AI when sharing generated content.

  • Language Support: The model currently supports English and Chinese only. Attempting to use other languages may yield unpredictable or low-quality results.

  • Non-Speech Audio: VibeVoice is designed specifically for vocal synthesis. It does not generate background noise, music, or sound effects.

  • Overlapping Speech: The current iteration of the model does not support or generate overlapping dialogue between speakers.

VibeVoice is intended strictly for research purposes. Do not use the model for commercial applications or production environments without extensive further testing and development. Please use this technology responsibly.