GraphGen automates the construction of knowledge graphs from unstructured text. It identifies specific areas where an LLM lacks proficiency and generates targeted question-answer (QA) pairs to bridge those gaps. The system prioritizes high-value, long-tail information over common knowledge, ensuring the resulting training data provides meaningful new insights for the model.
By sampling multi-hop neighborhoods, GraphGen captures complex relationships that span across multiple sentences. Integrated style controls ensure the generated QA pairs remain varied and natural. This process results in a dataset specifically designed to teach the model information it does not already possess.
A web-based demo is available for those who wish to test the functionality before deployment.
To start the interface, run python webui/app.py. You can upload any text source—such as agricultural manuals, healthcare records, or scientific papers—and provide your LLM API key. The tool then generates training datasets formatted for Llama Factory and other popular fine-tuning frameworks. Once the process is complete, the tool automatically cleans and sanitizes your data.
Run from PyPI
Install GraphGen:
pip install graphg
Configure the environment variables for both the synthesizer model (which constructs the graph) and the trainee model (the one you intend to fine-tune):
export SYNTHESIZER_MODEL=your_synthesizer_model_name
export SYNTHESIZER_BASE_URL=your_base_url
export SYNTHESIZER_API_KEY=your_api_key
export TRAINEE_MODEL=your_trainee_model_name
export TRAINEE_BASE_URL=your_base_url
export TRAINEE_API_KEY=your_api_key
Execute the following command:
graphg --output_dir cache
Run from Source
Install the required dependencies:
pip install -r requirements.txt
Configure the environment by copying the example file:
cp .env.example .env
Edit the .env file and input the six variables listed in the PyPI section above.
(Optional) Adjust the generation parameters in configs/graphgen_config.yaml.
Execute the generation script:
bash scripts/generate.sh
Review the generated data:
ls cache/data/graphgen
Run with Docker
Build the Docker image:
docker build -t graphgen .
Launch the container:
docker run -p 7860:7860 graphgen
1. Knowledge Construction The system parses source documents to extract entities and their corresponding relationships, forming a detailed, fine-grained knowledge graph.
2. Understanding Assessment GraphGen evaluates the LLM's current understanding of the extracted graph. By calculating the Expected Calibration Error (ECE), it identifies exactly what the model knows and where it is uncertain. Nodes associated with high error rates are prioritized, as these represent the specific facts the model lacks.
3. Graph Organization The extracted knowledge is organized into clusters. Various sampling strategies then determine which subgraphs are the most relevant to convert into training data.
4. QA Generation
MuMuAINovel: Write Novels With AI, Minus the Clutter
SPV VPN: Fast, Stable, and One-Click Unlimited Access
Prompt Tools: Open-Source Desktop App to Stop Losing Your Best AI Prompts
Tiny Qwen: A Clean PyTorch Implementation of Qwen3 and Qwen2.5-VL
SafeLine WAF Installation: System Requirements & Setup Guide
Agentic-Trading: Multi-Agent Simulator with A2A Protocol and ADK
Jessibuca Setup Guide: H5 Player Configuration, Decoding Modes, and Troubleshooting
LLM Bridge: A Unified API Schema for OpenAI, Claude, and Gemini
Ventoy USB Tool: Boot Multiple ISOs Without Reformatting
BiliNote: Convert YouTube and Bilibili Videos Into Markdown Notes
DeerFlow: Modular Multi-Agent Research With LangGraph and MCP
DBeaver: A Free Cross-Platform Database Tool (Plus CloudBeaver)