GraphGen automates the construction of knowledge graphs from unstructured text. It identifies specific areas where an LLM lacks proficiency and generates targeted question-answer (QA) pairs to bridge those gaps. The system prioritizes high-value, long-tail information over common knowledge, ensuring the resulting training data provides meaningful new insights for the model.
By sampling multi-hop neighborhoods, GraphGen captures complex relationships that span across multiple sentences. Integrated style controls ensure the generated QA pairs remain varied and natural. This process results in a dataset specifically designed to teach the model information it does not already possess.
A web-based demo is available for those who wish to test the functionality before deployment.
To start the interface, run python webui/app.py. You can upload any text source—such as agricultural manuals, healthcare records, or scientific papers—and provide your LLM API key. The tool then generates training datasets formatted for Llama Factory and other popular fine-tuning frameworks. Once the process is complete, the tool automatically cleans and sanitizes your data.
Run from PyPI
Install GraphGen:
pip install graphg
Configure the environment variables for both the synthesizer model (which constructs the graph) and the trainee model (the one you intend to fine-tune):
export SYNTHESIZER_MODEL=your_synthesizer_model_name
export SYNTHESIZER_BASE_URL=your_base_url
export SYNTHESIZER_API_KEY=your_api_key
export TRAINEE_MODEL=your_trainee_model_name
export TRAINEE_BASE_URL=your_base_url
export TRAINEE_API_KEY=your_api_key
Execute the following command:
graphg --output_dir cache
Run from Source
Install the required dependencies:
pip install -r requirements.txt
Configure the environment by copying the example file:
cp .env.example .env
Edit the .env file and input the six variables listed in the PyPI section above.
(Optional) Adjust the generation parameters in configs/graphgen_config.yaml.
Execute the generation script:
bash scripts/generate.sh
Review the generated data:
ls cache/data/graphgen
Run with Docker
Build the Docker image:
docker build -t graphgen .
Launch the container:
docker run -p 7860:7860 graphgen
1. Knowledge Construction The system parses source documents to extract entities and their corresponding relationships, forming a detailed, fine-grained knowledge graph.
2. Understanding Assessment GraphGen evaluates the LLM's current understanding of the extracted graph. By calculating the Expected Calibration Error (ECE), it identifies exactly what the model knows and where it is uncertain. Nodes associated with high error rates are prioritized, as these represent the specific facts the model lacks.
3. Graph Organization The extracted knowledge is organized into clusters. Various sampling strategies then determine which subgraphs are the most relevant to convert into training data.
4. QA Generation
OpenThoughts-Agent: Train Small AI Models with HPC Scale
Open Computer Use: AI Agents with Hands-On Desktop Control
Fast RAG: Deploy a Private Hybrid Search RAG Stack Locally
Qwen3-ASR-Studio: Real-Time Voice Recognition with PiP Mode
Lens Desktop Installation Guide: macOS, Windows, Linux
SafeLine WAF Installation: System Requirements & Setup Guide
Weapp-QRCode: Generating QR Codes in WeChat Mini Programs
KVoiceWalk: Clone Any Voice for Kokoro TTS Using Random Walks
ChatTTS: A Text-to-Speech Model Optimized for Dialogue
How to Highlight Top 3 and Bottom 3 Bars in an Excel Chart
Shendeng VPN: Unlimited Bandwidth, Smart Routing & VIP Membership (¥28/Month)
Liebao VPN: Download, Install & Use on Android & iOS