HiChunk addresses a persistent bottleneck in Retrieval-Augmented Generation (RAG): rigid, context-blind chunking. Standard approaches typically split documents at fixed intervals, a method that frequently severs critical sentences mid-thought. HiCBench, the evaluation framework included with this toolkit, quantifies the performance loss caused by such imprecise segmentation. By using richly annotated data to generate evidence-based question-answer pairs, it identifies specific bottlenecks that generic RAG benchmarks often overlook.
The core innovation lies in hierarchical chunking paired with an Auto-Merge retrieval algorithm. Rather than relying on static segment sizes, HiChunk dynamically adjusts semantic granularity. This allows for the precision of a small chunk without sacrificing the broader context of the surrounding section. It effectively prevents the common failure mode where a retriever identifies a keyword fragment but misses the essential explanatory text around it.
HiChunk provides a complete experimental pipeline. More than just a chunking tool, it includes scripts for data processing, model training via LLaMA-Factory, response generation, and final scoring. This creates a repeatable workflow for optimizing document understanding before moving to a production environment.
To begin, clone the repository and configure a clean environment.
git clone https://github.com/TencentYoutuResearch/HiChunk.git
cd HiChunk
conda create -n HiChunk python=3.10
conda activate HiChunk
pip install -r requirements.txt
python -c "import nltk; nltk.download('punkt_tab')"
First, download the raw datasets: qasper, gov-report, and wiki-727k. Unzip them to your local directory.
Update the origin_data_path variable in the process_train_data.ipynb notebook:
origin_data_path = 'path/to/qasper'
origin_data_path = 'path/to/gov-report'
origin_data_path = 'path/to/wiki_727'
Next, run build_train_data.py. This script outputs processed files to corpus/combined, which can be fed directly into LLaMA-Factory for training the HiChunk model:
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e . --no-build-isolation
pip install deepspeed==0.16.9
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
llamafactory-cli train ../HiChunk_train_config.yaml
Each entry within HiChunk/dataset/qas/{dataset}.txt adheres to the following schema:
{
"input": "string, the question text",
"answers": "list[str], all valid answer strings",
"facts": "list[str], factual statements from the answers",
"evidences": "list[str], source sentences tied to the question",
"all_classes": "list[str], used by eval.py for subset metrics",
"_id": "string, document identifier"
}
Run the provided shell scripts to compare different segmentation methods. SemanticChunk and LumberChunk serve as the baseline benchmarks. Note that HiChunk requires a designated model path.
# SemanticChunk
bash pipeline/chunking/SemanticChunk/semantic_chunking.sh
# LumberChunk
export MODEL_TYPE="Deepseek"
export DS_BASE_URL="http://{ip}:{port}"
bash pipeline/chunking/LumberChunk/lumber_chunking.sh
# HiChunk
export MODEL_PATH="path/to/HiChunk_model"
bash pipeline/chunking/HiChunk/hi_chunking.sh
# Inspect the output results
python pipeline/chunking/chunk_result_analysis.py
The mBGE.sh script is used to construct the retrieval corpus. The arguments required are {chunk_type} and {chunk_size}. For SemanticChunk (SC), LumberChunk (LC), and HiChunk (HC), setting the size to a very high value (e.g., 100,000) allows the system to bypass secondary rule-based splitting and utilize the model's native boundaries.
bash pipeline/mBGE.sh C 200 # Fixed-size chunking, 200 characters
bash pipeline/mBGE.sh SC 100000 # Semantic chunking
bash pipeline/mBGE.sh LC 100000 # LumberChunk
bash pipeline/mBGE.sh HC 200 # HiChunk with 200-character max sub-chunks
Initialize a vLLM instance and execute pred.py. The final example demonstrates how to use the Auto-Merge flag.
# Launch the model server
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000
# Generate model answers
python pred.py --model llama3.1-8b --data BgeM3/C200 --token_num 4096 --port 8000
python pred.py --model llama3.1-8b --data BgeM3/SC100000 --token_num 4096 --port 8000
python pred.py --model llama3.1-8b --data BgeM3/LC100000 --token_num 4096 --port 8000
python pred.py --model llama3.1-8b --data BgeM3/HC200_L10 --token_num 4096 --port 8000
python pred.py --model llama3.1-8b --data BgeM3/HC200_L10 --token_num 4096 --auto_merge 1 --port 8000
Run eval.py on each output directory. By comparing the resulting scores, you can observe how HiChunk combined with Auto-Merge outperforms static chunking methods.
python eval.py --model llama3.1-8b --data BgeM3/C200_tk4096
python eval.py --model llama3.1-8b --data BgeM3/SC100000_tk4096
python eval.py --model llama3.1-8b --data BgeM3/LC100000_tk4096
python eval.py --model llama3.1-8b --data BgeM3/HC200_L10_tk4096
python eval.py --model llama3.1-8b --data BgeM3/HC200_L10_tk4096_AM1
MuMuAINovel: Write Novels With AI, Minus the Clutter
Shendeng VPN Review: High-Speed Gaming, Video Streaming, and Unlimited Data
Checkmate: Open-Source Server Monitoring with Uptime Alerts
Google Analytics MCP Server: Query GA4 Data With Gemini CLI
Agentic-Trading: Multi-Agent Simulator with A2A Protocol and ADK
NotebookLlama: An Open-Source NotebookLM Alternative with AI Voice
Fooocus: Free Offline SDXL Image Generator & Installation Guide
MindForger Review: A Private Markdown IDE for Personal Knowledge Management
SelfyAI: Build Your Own AI Agent as a Virtual World Asset
Weapp-QRCode: Generating QR Codes in WeChat Mini Programs
Jitsi Meet Review: Open-Source Video Conferencing That Just Works
ALLinSSL: Automated SSL Certificate Lifecycle Management