HunyuanVideo-Avatar is a multimodal diffusion Transformer model developed by Tencent Hunyuan. It is designed to generate dynamic, emotion-controlled videos featuring multi-person dialogues. This cloud-native framework produces high-quality character animations driven by synchronized audio-visual inputs.
The model utilizes a dedicated person image injection module to replace traditional overlay-based conditioning. This approach addresses the common mismatch between training and inference phases, ensuring that character movements remain natural while maintaining strict identity consistency throughout the video.
The Audio Emotion Module (AEM) extracts emotional cues from a reference image and transfers them to the target video. It provides fine-grained, precise control over emotional styles, ensuring that facial expressions are tightly synchronized with the nuances of the audio track.
The Face-Aware Audio Adapter (FAA) employs a latent-level face mask to isolate specific audio-driven subjects. In scenes featuring multiple people, the FAA injects audio signals independently through cross-attention mechanisms. This allows for realistic, back-and-forth interactions between different characters within the same frame.
• Support for multi-style character images at any aspect ratio or resolution, including realistic photos, cartoons, 3D renders, and anthropomorphic designs. • Compatible with portrait, half-body, and full-body shots to accommodate diverse production requirements.
• Transform a static character image into a high-dynamic video using only an audio file. The resulting output features movement in both the foreground and background, significantly enhancing realism. • Direct control over facial expressions via audio input, ensuring the character's emotions match the tone and rhythm of the sound.
The Face-Aware Audio Adapter enables the model to drive multiple characters independently using separate audio tracks. This capability is ideal for creating fluid, natural animations for complex multi-person dialogue scenes.
E-commerce & Live Streaming: Create virtual hosts for product explainer videos to improve presentation quality and engagement.
Social Media Content Creation: Rapidly produce personalized animated videos tailored for short-form video platforms.
Film & Advertising Production: Streamline the animation of multi-character dialogue sequences, reducing the costs and logistical hurdles of live-action filming.
Education & Training: Develop interactive teaching animations to make knowledge transfer more engaging for learners.
GPU: NVIDIA GPU with CUDA support. 96GB of VRAM is recommended for optimal quality. A minimum of 24GB VRAM is required, though performance will be slower.
OS: Specifically tested and verified on Linux distributions.
Clone the repository
git clone https://github.com/Tencent-Hunyuan/HunyuanVideo-Avatar.git
cd HunyuanVideo-Avatar
Create Conda environment
conda create -n HunyuanVideo-Avatar python==3.10.9
conda activate HunyuanVideo-Avatar
Install PyTorch and dependencies
For CUDA 11.8
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=11.8 -c pytorch -c nvidia
For CUDA 12.4
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.4 -c pytorch -c nvidia
Install pip dependencies
python -m pip install -r requirements.txt
Install Flash Attention V2 (optional, for performance optimization)
python -m pip install ninja
python -m pip install git+https://github.com/Dao-AILab/[email protected]
For CUDA 12.4
docker pull hunyuanvideo/hunyuanvideo:cuda_12
docker run -itd --gpus all --init --net=host --uts=host --ipc=host --name hunyuanvideo --security-opt=seccomp=unconfined --ulimit=stack=67108864 --ulimit=memlock=-1 --privileged hunyuanvideo/hunyuanvideo:cuda_12
pip install gradio==3.39.0 diffusers==0.33.0 transformers==4.41.2
For CUDA 11.8
docker pull hunyuanvideo/hunyuanvideo:cuda_11
docker run -itd --gpus all --init --net=host --uts=host --ipc=host --name hunyuanvideo --security-opt=seccomp=unconfined --ulimit=stack=67108864 --ulimit=memlock=-1 --privileged hunyuanvideo/hunyuanvideo:cuda_11
pip install gradio==3.39.0 diffusers==0.33.0 transformers==4.41.2
cd HunyuanVideo-Avatar
export PYTHONPATH=./
export MODEL_BASE="./weights"
checkpoint_path=${MODEL_BASE}/ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt
torchrun --nnodes=1 --nproc_per_node=8 --master_port 29605 hymm_sp/sample_batch.py \
--input 'assets/test.csv' \
--ckpt ${checkpoint_path} \
--sample-n-frames 129 \
--seed 128 \
--image-size 704 \
--cfg-scale 7.5 \
--infer-steps 50 \
--use-deepcache 1 \
--flow-shift-eval-video 5.0 \
--save-path ${OUTPUT_BASEPATH}
cd HunyuanVideo-Avatar
export PYTHONPATH=./
export MODEL_BASE=./weights
OUTPUT_BASEPATH=./results-single
checkpoint_path=${MODEL_BASE}/ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states_fp8.pt
export DISABLE_SP=1
CUDA_VISIBLE_DEVICES=0 python3 hymm_sp/sample_gpu_poor.py \
--input 'assets/test.csv' \
--ckpt ${checkpoint_path} \
--sample-n-frames 129 \
--seed 128 \
--image-size 704 \
--cfg-scale 7.5 \
--infer-steps 50 \
--use-deepcache 1 \
--flow-shift-eval-video 5.0 \
--save-path ${OUTPUT_BASEPATH} \
--use-fp8 \
--infer-min
cd HunyuanVideo-Avatar
export PYTHONPATH=./
export MODEL_BASE=./weights
OUTPUT_BASEPATH=./results-poor
checkpoint_path=${MODEL_BASE}/ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states_fp8.pt
export CPU_OFFLOAD=1
CUDA_VISIBLE_DEVICES=0 python3 hymm_sp/sample_gpu_poor.py \
--input 'assets/test.csv' \
--ckpt ${checkpoint_path} \
--sample-n-frames 129 \
--seed 128 \
--image-size 704 \
--cfg-scale 7.5 \
--infer-steps 50 \
--use-deepcache 1 \
--flow-shift-eval-video 5.0 \
--save-path ${OUTPUT_BASEPATH} \
--use-fp8 \
--cpu-offload \
--infer-min
cd HunyuanVideo-Avatar
bash ./scripts/run_gradio.sh
Build Agent Kurama: A Private Local Research Assistant with LangChain & Ollama
GRAG: Continuous Image Editing Control for DiT Models
Qwen3-ASR-Toolkit: Transcribe Long Audio Files Beyond the 3-Minute Limit
DeepDoc Turns Local Files Into AI Research Reports (No Cloud Needed)
Kode CLI: A Multi-Model AI Terminal Assistant for Smarter Development
OxyGent: Build Multi-Agent Systems That Learn and Scale Without YAML
Gemini-CLI-UI: A Web Interface for the Google Gemini CLI Coding Assistant
HunyuanVideo-Avatar: Emotion-Controlled Multi-Person Video Generation
Tabby Terminal: A Cross-Platform Emulator with SSH, Serial Support, and Plugins
DBeaver: A Free Cross-Platform Database Tool (Plus CloudBeaver)
Shendeng VPN: Two Modes to Speed Up Games and Chinese Apps
Shendeng VPN: Unlimited Bandwidth, Smart Routing & VIP Membership (¥28/Month)