HiChunk Review: Smarter Chunking for RAG Pipelines

11月17日 Published inRAG Tools

HiChunk addresses a persistent bottleneck in Retrieval-Augmented Generation (RAG): rigid, context-blind chunking. Standard approaches typically split documents at fixed intervals, a method that frequently severs critical sentences mid-thought. HiCBench, the evaluation framework included with this toolkit, quantifies the performance loss caused by such imprecise segmentation. By using richly annotated data to generate evidence-based question-answer pairs, it identifies specific bottlenecks that generic RAG benchmarks often overlook.

The core innovation lies in hierarchical chunking paired with an Auto-Merge retrieval algorithm. Rather than relying on static segment sizes, HiChunk dynamically adjusts semantic granularity. This allows for the precision of a small chunk without sacrificing the broader context of the surrounding section. It effectively prevents the common failure mode where a retriever identifies a keyword fragment but misses the essential explanatory text around it.

HiChunk provides a complete experimental pipeline. More than just a chunking tool, it includes scripts for data processing, model training via LLaMA-Factory, response generation, and final scoring. This creates a repeatable workflow for optimizing document understanding before moving to a production environment.

HiChunk Installation

To begin, clone the repository and configure a clean environment.

git clone https://github.com/TencentYoutuResearch/HiChunk.git
cd HiChunk

conda create -n HiChunk python=3.10
conda activate HiChunk

pip install -r requirements.txt
python -c "import nltk; nltk.download('punkt_tab')"

Training Data Preparation

First, download the raw datasets: qasper, gov-report, and wiki-727k. Unzip them to your local directory.

Update the origin_data_path variable in the process_train_data.ipynb notebook:

origin_data_path = 'path/to/qasper'
origin_data_path = 'path/to/gov-report'
origin_data_path = 'path/to/wiki_727'

Next, run build_train_data.py. This script outputs processed files to corpus/combined, which can be fed directly into LLaMA-Factory for training the HiChunk model:

git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e . --no-build-isolation
pip install deepspeed==0.16.9

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
llamafactory-cli train ../HiChunk_train_config.yaml

QA Data Structure

Each entry within HiChunk/dataset/qas/{dataset}.txt adheres to the following schema:

{
"input": "string, the question text",
"answers": "list[str], all valid answer strings",
"facts": "list[str], factual statements from the answers",
"evidences": "list[str], source sentences tied to the question",
"all_classes": "list[str], used by eval.py for subset metrics",
"_id": "string, document identifier"
}

Running Document Chunking

Run the provided shell scripts to compare different segmentation methods. SemanticChunk and LumberChunk serve as the baseline benchmarks. Note that HiChunk requires a designated model path.

# SemanticChunk
bash pipeline/chunking/SemanticChunk/semantic_chunking.sh

# LumberChunk
export MODEL_TYPE="Deepseek"
export DS_BASE_URL="http://{ip}:{port}"
bash pipeline/chunking/LumberChunk/lumber_chunking.sh

# HiChunk
export MODEL_PATH="path/to/HiChunk_model"
bash pipeline/chunking/HiChunk/hi_chunking.sh

# Inspect the output results
python pipeline/chunking/chunk_result_analysis.py

Building the Test Dataset

The mBGE.sh script is used to construct the retrieval corpus. The arguments required are {chunk_type} and {chunk_size}. For SemanticChunk (SC), LumberChunk (LC), and HiChunk (HC), setting the size to a very high value (e.g., 100,000) allows the system to bypass secondary rule-based splitting and utilize the model's native boundaries.

bash pipeline/mBGE.sh C 200           # Fixed-size chunking, 200 characters
bash pipeline/mBGE.sh SC 100000       # Semantic chunking
bash pipeline/mBGE.sh LC 100000       # LumberChunk
bash pipeline/mBGE.sh HC 200          # HiChunk with 200-character max sub-chunks

Generating Responses

Initialize a vLLM instance and execute pred.py. The final example demonstrates how to use the Auto-Merge flag.

# Launch the model server
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000

# Generate model answers
python pred.py --model llama3.1-8b --data BgeM3/C200 --token_num 4096 --port 8000
python pred.py --model llama3.1-8b --data BgeM3/SC100000 --token_num 4096 --port 8000
python pred.py --model llama3.1-8b --data BgeM3/LC100000 --token_num 4096 --port 8000
python pred.py --model llama3.1-8b --data BgeM3/HC200_L10 --token_num 4096 --port 8000
python pred.py --model llama3.1-8b --data BgeM3/HC200_L10 --token_num 4096 --auto_merge 1 --port 8000

Evaluation

Run eval.py on each output directory. By comparing the resulting scores, you can observe how HiChunk combined with Auto-Merge outperforms static chunking methods.

python eval.py --model llama3.1-8b --data BgeM3/C200_tk4096
python eval.py --model llama3.1-8b --data BgeM3/SC100000_tk4096
python eval.py --model llama3.1-8b --data BgeM3/LC100000_tk4096
python eval.py --model llama3.1-8b --data BgeM3/HC200_L10_tk4096
python eval.py --model llama3.1-8b --data BgeM3/HC200_L10_tk4096_AM1

▶ Visit

Related Tools

HiChunk Review: Smarter Chunking for RAG Pipelines

Fast RAG: Deploy a Private Hybrid Search RAG Stack Locally

Graph-Code: Query Your Codebase via Natural Language with LLM-Powered RAG

Open Source 3D Tetris in Your Browser With React and Three.js