Paper2Video automates the creation of professional presentation videos directly from academic research. By processing a LaTeX source folder, a portrait photograph, and a brief audio sample, the integrated PaperTalker agent generates a complete presentation. The system produces optimized slides, synchronized subtitles, and a lifelike talking head, complete with a virtual cursor that tracks the narration in real time. Leveraging a suite of Large Language Models (LLMs) and Vision-Language Models (VLMs) for content synthesis, Paper2Video evaluates its output against a specialized benchmark designed to measure how effectively the video communicates a paper's core intellectual contributions rather than focusing solely on visual fidelity.
The process is initiated with three core components:
The final output is a comprehensive academic presentation video including slides, subtitles, an animated avatar, and a synchronized pointer.
1. Slide Construction: PaperTalker parses the LaTeX source code. It utilizes a tree search algorithm to resolve common compilation issues—such as undefined control sequences or overfull boxes—while optimizing the layout to meet academic presentation standards.
2. Subtitle Generation: The system synthesizes content and speech logic to produce subtitles that are perfectly timed to the generated narration. Users can provide an optional reference text to fine-tune the linguistic style of the subtitles.
3. Cursor Alignment: Using WhisperX, the system aligns the spoken text with specific timestamps. VICRe then maps these timestamps to relevant slide regions, ensuring the cursor moves precisely to the elements being discussed.
4. Voice and Avatar: The reference audio is used to clone the speaker's voice, while the reference image drives the talking-head model. Currently, the pipeline supports Hallo2 for this animation stage.
5. Video Assembly: All components—slides, subtitles, cursor, avatar, and audio—are merged into a final video. To improve efficiency, the system utilizes parallel processing across multiple slides simultaneously.
The system requires two distinct Conda environments to prevent dependency conflicts: one for the primary pipeline and another specifically for the talking-head module.
cd src
conda create -n p2v python=3.10
conda activate p2v
pip install -r requirements.txt
git clone https://github.com/fudan-generative-vision/hallo2.git
git clone https://github.com/Paper2Poster/Paper2Poster.git
cd hallo2
conda create -n hallo python=3.10
conda activate hallo
pip install -r requirements.txt
which python # Note this path for configuration
Set up your API keys as environment variables. The system provides native support for models such as GPT-4.1 and Gemini 2.5-Pro, as well as local models like Qwen.
export GEMINI_API_KEY="your-key"
export OPENAI_API_KEY="your-key"
The primary execution script is pipeline.py. An NVIDIA A6000 with 48GB of VRAM is recommended as the baseline hardware requirement.
python pipeline.py \
--model_name_t gpt-4.1 \
--model_name_v gpt-4.1 \
--model_name_talking hallo2 \
--result_dir /path/to/output \
--paper_latex_root /path/to/latex_project \
--ref_img /path/to/portrait.png \
--ref_audio /path/to/voice.wav \
--talking_head_env /path/to/hallo_env/bin/python \
--gpu_list [0,1,2,3,4,5,6,7]
Key Arguments
| Argument | Description |
|---|---|
--model_name_t |
LLM selected for text processing |
--model_name_v |
VLM selected for visual tasks |
--model_name_talking |
Avatar model (currently supporting hallo2) |
--result_dir |
Directory for slides, subtitles, and the final video |
--paper_latex_root |
Root directory of the LaTeX project |
--ref_img |
Path to the square portrait image |
--ref_audio |
Path to a ~10-second audio clip |
--ref_text |
Optional text to guide the subtitle style |
--beamer_templete_prompt |
Optional prompt to guide the slide aesthetics |
--gpu_list |
GPUs assigned for parallel cursor and avatar rendering |
--if_tree_search |
Toggles layout optimization (default: True) |
--stage |
Specifies pipeline stages to run (e.g., [0] for a full run) |
--talking_head_env |
Path to the Python binary in the hallo environment |
Standard video metrics like FVD or IS are often irrelevant for academic content. A successful research presentation is defined by audience comprehension and author attribution. This benchmark evaluates performance from both viewer and creator perspectives.
Meta Similarity: Measures how closely the generated video aligns with human-authored presentations across audio and content dimensions.
PresentArena: Conducts a comparative analysis between AI-generated videos and human-crafted presentations to determine which best adheres to academic norms.
PresentQuiz: Generates multiple-choice questions based on the paper's content to measure the effectiveness of knowledge transfer to the viewer.
IP Memory: Uses pairs of questions focused on a paper's specific contributions to test if the video effectively credits the authors and contextualizes the research.
Initialize a dedicated evaluation environment:
cd src/evaluation
conda create -n p2v_e python=3.10
conda activate p2v_e
pip install -r requirements.txt
Execute the evaluation scripts as needed:
# Meta Similarity
python MetaSim_audio.py --r /path/to/results --g /path/to/ground_truth --s /path/to/save
python MetaSim_content.py --r /path/to/results --g /path/to/ground_truth --s /path/to/save
# PresentArena
python PresentArena.py --r /path/to/results --g /path/to/ground_truth --s /path/to/save
# PresentQuiz
cd PresentQuiz
python create_paper_questions.py --paper_folder /path/to/papers
python PresentQuiz.py --r /path/to/results --g /path/to/ground_truth --s /path/to/save
# IP Memory
cd IPMemory
python construct.py
python ip_qa.py
The benchmark dataset is available for download on Hugging Face.
Tencent HunyuanVideo-1.5: 8.3B Video Model Runs on 14GB GPUs
ReCode: Recursive Code Generation for LLM Agents
MOSS-Speech: Real Voice-to-Voice AI Without Text Bottlenecks
Sunshine Streaming Host Specs: What Hardware You Actually Need
BitzNet SD-WAN: Secure SD-WAN for Faster, Safer Internet Access
Lens Desktop Installation Guide: macOS, Windows, Linux
Puter: An Open-Source Personal Cloud OS for Files, Apps, and Games
Koishi Chatbot Framework: Build a Cross-Platform Bot in Minutes
syftr: Optimize Agent Workflows with Pareto Front Search
TypeAgent: Build AI Agents With Structured Memory and Human-in-the-Loop
Motionity: Free Online Animation Editor with Keyframes and Masks
sherpa-onnx: Offline Speech Recognition, TTS, and VAD Without the Cloud