Vision-Language-Action Models

LLM Training

Video Foundation Models

Image Tools

Dictionaries & Lexicons

Cryptocurrency Tools

Watermark Removal Tools

OCR Tools

Voice Interaction Models

AI Service Tools

ToolBoost >> Academic Paper Tools >> Paper2Video: Transforming LaTeX Papers into AI-Generated Presentation Videos

Paper2Video: Transforming LaTeX Papers into AI-Generated Presentation Videos

10月12日 Published inAcademic Paper Tools

Paper2Video automates the creation of professional presentation videos directly from academic research. By processing a LaTeX source folder, a portrait photograph, and a brief audio sample, the integrated PaperTalker agent generates a complete presentation. The system produces optimized slides, synchronized subtitles, and a lifelike talking head, complete with a virtual cursor that tracks the narration in real time. Leveraging a suite of Large Language Models (LLMs) and Vision-Language Models (VLMs) for content synthesis, Paper2Video evaluates its output against a specialized benchmark designed to measure how effectively the video communicates a paper's core intellectual contributions rather than focusing solely on visual fidelity.

How PaperTalker Works

Inputs

The process is initiated with three core components:

LaTeX Source: The complete project directory of the paper, rather than a standalone PDF.
Reference Image: A square-format portrait used to generate the talking-head avatar.
Reference Audio: A ten-second speech sample used for voice cloning.

The final output is a comprehensive academic presentation video including slides, subtitles, an animated avatar, and a synchronized pointer.

The Five-Step Pipeline

1. Slide Construction: PaperTalker parses the LaTeX source code. It utilizes a tree search algorithm to resolve common compilation issues—such as undefined control sequences or overfull boxes—while optimizing the layout to meet academic presentation standards.

2. Subtitle Generation: The system synthesizes content and speech logic to produce subtitles that are perfectly timed to the generated narration. Users can provide an optional reference text to fine-tune the linguistic style of the subtitles.

3. Cursor Alignment: Using WhisperX, the system aligns the spoken text with specific timestamps. VICRe then maps these timestamps to relevant slide regions, ensuring the cursor moves precisely to the elements being discussed.

4. Voice and Avatar: The reference audio is used to clone the speaker's voice, while the reference image drives the talking-head model. Currently, the pipeline supports Hallo2 for this animation stage.

5. Video Assembly: All components—slides, subtitles, cursor, avatar, and audio—are merged into a final video. To improve efficiency, the system utilizes parallel processing across multiple slides simultaneously.

Installation and Setup

The system requires two distinct Conda environments to prevent dependency conflicts: one for the primary pipeline and another specifically for the talking-head module.

Core Environment

cd src
conda create -n p2v python=3.10
conda activate p2v
pip install -r requirements.txt
git clone https://github.com/fudan-generative-vision/hallo2.git
git clone https://github.com/Paper2Poster/Paper2Poster.git

Talking-Head Environment

cd hallo2
conda create -n hallo python=3.10
conda activate hallo
pip install -r requirements.txt
which python   # Note this path for configuration

Configure LLM Access

Set up your API keys as environment variables. The system provides native support for models such as GPT-4.1 and Gemini 2.5-Pro, as well as local models like Qwen.

export GEMINI_API_KEY="your-key"
export OPENAI_API_KEY="your-key"

Run the Pipeline

The primary execution script is pipeline.py. An NVIDIA A6000 with 48GB of VRAM is recommended as the baseline hardware requirement.

python pipeline.py \
  --model_name_t gpt-4.1 \
  --model_name_v gpt-4.1 \
  --model_name_talking hallo2 \
  --result_dir /path/to/output \
  --paper_latex_root /path/to/latex_project \
  --ref_img /path/to/portrait.png \
  --ref_audio /path/to/voice.wav \
  --talking_head_env /path/to/hallo_env/bin/python \
  --gpu_list [0,1,2,3,4,5,6,7]

Key Arguments

Argument	Description
`--model_name_t`	LLM selected for text processing
`--model_name_v`	VLM selected for visual tasks
`--model_name_talking`	Avatar model (currently supporting `hallo2`)
`--result_dir`	Directory for slides, subtitles, and the final video
`--paper_latex_root`	Root directory of the LaTeX project
`--ref_img`	Path to the square portrait image
`--ref_audio`	Path to a ~10-second audio clip
`--ref_text`	Optional text to guide the subtitle style
`--beamer_templete_prompt`	Optional prompt to guide the slide aesthetics
`--gpu_list`	GPUs assigned for parallel cursor and avatar rendering
`--if_tree_search`	Toggles layout optimization (default: `True`)
`--stage`	Specifies pipeline stages to run (e.g., `[0]` for a full run)
`--talking_head_env`	Path to the Python binary in the hallo environment

The Paper2Video Benchmark

Standard video metrics like FVD or IS are often irrelevant for academic content. A successful research presentation is defined by audience comprehension and author attribution. This benchmark evaluates performance from both viewer and creator perspectives.

Meta Similarity: Measures how closely the generated video aligns with human-authored presentations across audio and content dimensions.

PresentArena: Conducts a comparative analysis between AI-generated videos and human-crafted presentations to determine which best adheres to academic norms.

PresentQuiz: Generates multiple-choice questions based on the paper's content to measure the effectiveness of knowledge transfer to the viewer.

IP Memory: Uses pairs of questions focused on a paper's specific contributions to test if the video effectively credits the authors and contextualizes the research.

Running the Evaluation

Initialize a dedicated evaluation environment:

cd src/evaluation
conda create -n p2v_e python=3.10
conda activate p2v_e
pip install -r requirements.txt

Execute the evaluation scripts as needed:

# Meta Similarity
python MetaSim_audio.py --r /path/to/results --g /path/to/ground_truth --s /path/to/save
python MetaSim_content.py --r /path/to/results --g /path/to/ground_truth --s /path/to/save

# PresentArena
python PresentArena.py --r /path/to/results --g /path/to/ground_truth --s /path/to/save

# PresentQuiz
cd PresentQuiz
python create_paper_questions.py --paper_folder /path/to/papers
python PresentQuiz.py --r /path/to/results --g /path/to/ground_truth --s /path/to/save

# IP Memory
cd IPMemory
python construct.py
python ip_qa.py

The benchmark dataset is available for download on Hugging Face.

▶ Visit

Related Tools

Paper2Video: Transforming LaTeX Papers into AI-Generated Presentation Videos

AI Peer Review Tool for Neuroscience: LLM-Driven Meta-Review Framework

NOF0 Open Source AI Trading Arena Puts Crypto Models Head to Head

Tongyi DeepResearch: 30B Agent Model Beats GPT and Claude on Search Benchmarks