HunyuanVideo-Avatar: Emotion-Controlled Multi-Person Video Generation

5月29日 Published inVideo Tools

HunyuanVideo-Avatar is a multimodal diffusion Transformer model developed by Tencent Hunyuan. It is designed to generate dynamic, emotion-controlled videos featuring multi-person dialogues. This cloud-native framework produces high-quality character animations driven by synchronized audio-visual inputs.

Person Image Injection Module

The model utilizes a dedicated person image injection module to replace traditional overlay-based conditioning. This approach addresses the common mismatch between training and inference phases, ensuring that character movements remain natural while maintaining strict identity consistency throughout the video.

Audio Emotion Module (AEM)

The Audio Emotion Module (AEM) extracts emotional cues from a reference image and transfers them to the target video. It provides fine-grained, precise control over emotional styles, ensuring that facial expressions are tightly synchronized with the nuances of the audio track.

Face-Aware Audio Adapter (FAA)

The Face-Aware Audio Adapter (FAA) employs a latent-level face mask to isolate specific audio-driven subjects. In scenes featuring multiple people, the FAA injects audio signals independently through cross-attention mechanisms. This allows for realistic, back-and-forth interactions between different characters within the same frame.

Generate Characters in Any Style

• Support for multi-style character images at any aspect ratio or resolution, including realistic photos, cartoons, 3D renders, and anthropomorphic designs. • Compatible with portrait, half-body, and full-body shots to accommodate diverse production requirements.

Generate High-Dynamic, Emotion-Controlled Videos

• Transform a static character image into a high-dynamic video using only an audio file. The resulting output features movement in both the foreground and background, significantly enhancing realism. • Direct control over facial expressions via audio input, ensuring the character's emotions match the tone and rhythm of the sound.

Animation Capability

The Face-Aware Audio Adapter enables the model to drive multiple characters independently using separate audio tracks. This capability is ideal for creating fluid, natural animations for complex multi-person dialogue scenes.

HunyuanVideo-Avatar Use Cases

E-commerce & Live Streaming: Create virtual hosts for product explainer videos to improve presentation quality and engagement.

Social Media Content Creation: Rapidly produce personalized animated videos tailored for short-form video platforms.

Film & Advertising Production: Streamline the animation of multi-character dialogue sequences, reducing the costs and logistical hurdles of live-action filming.

Education & Training: Develop interactive teaching animations to make knowledge transfer more engaging for learners.

System Requirements & Installation Guide

Hardware Requirements

GPU: NVIDIA GPU with CUDA support. 96GB of VRAM is recommended for optimal quality. A minimum of 24GB VRAM is required, though performance will be slower.

OS: Specifically tested and verified on Linux distributions.

Installation Steps
  1. Clone the repository

    git clone https://github.com/Tencent-Hunyuan/HunyuanVideo-Avatar.git
    cd HunyuanVideo-Avatar
    
  2. Create Conda environment

    conda create -n HunyuanVideo-Avatar python==3.10.9
    conda activate HunyuanVideo-Avatar
    
  3. Install PyTorch and dependencies

    For CUDA 11.8

    conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=11.8 -c pytorch -c nvidia
    

    For CUDA 12.4

    conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.4 -c pytorch -c nvidia
    
  4. Install pip dependencies

    python -m pip install -r requirements.txt
    
  5. Install Flash Attention V2 (optional, for performance optimization)

    python -m pip install ninja
    python -m pip install git+https://github.com/Dao-AILab/[email protected]
    
Docker Deployment

For CUDA 12.4

docker pull hunyuanvideo/hunyuanvideo:cuda_12
docker run -itd --gpus all --init --net=host --uts=host --ipc=host --name hunyuanvideo --security-opt=seccomp=unconfined --ulimit=stack=67108864 --ulimit=memlock=-1 --privileged hunyuanvideo/hunyuanvideo:cuda_12
pip install gradio==3.39.0 diffusers==0.33.0 transformers==4.41.2

For CUDA 11.8

docker pull hunyuanvideo/hunyuanvideo:cuda_11
docker run -itd --gpus all --init --net=host --uts=host --ipc=host --name hunyuanvideo --security-opt=seccomp=unconfined --ulimit=stack=67108864 --ulimit=memlock=-1 --privileged hunyuanvideo/hunyuanvideo:cuda_11
pip install gradio==3.39.0 diffusers==0.33.0 transformers==4.41.2

Inference Instructions

Multi-GPU parallel inference (example using 8 GPUs)
cd HunyuanVideo-Avatar
export PYTHONPATH=./
export MODEL_BASE="./weights"
checkpoint_path=${MODEL_BASE}/ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt
torchrun --nnodes=1 --nproc_per_node=8 --master_port 29605 hymm_sp/sample_batch.py \
 --input 'assets/test.csv' \
 --ckpt ${checkpoint_path} \
 --sample-n-frames 129 \
 --seed 128 \
 --image-size 704 \
 --cfg-scale 7.5 \
 --infer-steps 50 \
 --use-deepcache 1 \
 --flow-shift-eval-video 5.0 \
 --save-path ${OUTPUT_BASEPATH}
Single-GPU inference
cd HunyuanVideo-Avatar
export PYTHONPATH=./
export MODEL_BASE=./weights
OUTPUT_BASEPATH=./results-single
checkpoint_path=${MODEL_BASE}/ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states_fp8.pt
export DISABLE_SP=1
CUDA_VISIBLE_DEVICES=0 python3 hymm_sp/sample_gpu_poor.py \
 --input 'assets/test.csv' \
 --ckpt ${checkpoint_path} \
 --sample-n-frames 129 \
 --seed 128 \
 --image-size 704 \
 --cfg-scale 7.5 \
 --infer-steps 50 \
 --use-deepcache 1 \
 --flow-shift-eval-video 5.0 \
 --save-path ${OUTPUT_BASEPATH} \
 --use-fp8 \
 --infer-min
Low-VRAM inference (utilizing CPU offloading)
cd HunyuanVideo-Avatar
export PYTHONPATH=./
export MODEL_BASE=./weights
OUTPUT_BASEPATH=./results-poor
checkpoint_path=${MODEL_BASE}/ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states_fp8.pt
export CPU_OFFLOAD=1
CUDA_VISIBLE_DEVICES=0 python3 hymm_sp/sample_gpu_poor.py \
 --input 'assets/test.csv' \
 --ckpt ${checkpoint_path} \
 --sample-n-frames 129 \
 --seed 128 \
 --image-size 704 \
 --cfg-scale 7.5 \
 --infer-steps 50 \
 --use-deepcache 1 \
 --flow-shift-eval-video 5.0 \
 --save-path ${OUTPUT_BASEPATH} \
 --use-fp8 \
 --cpu-offload \
 --infer-min
Launch Gradio service
cd HunyuanVideo-Avatar
bash ./scripts/run_gradio.sh