GraphGen: Build Knowledge Graphs to Generate Smarter Training Data

5月14日 Published inLLM Tooling

GraphGen automates the construction of knowledge graphs from unstructured text. It identifies specific areas where an LLM lacks proficiency and generates targeted question-answer (QA) pairs to bridge those gaps. The system prioritizes high-value, long-tail information over common knowledge, ensuring the resulting training data provides meaningful new insights for the model.

By sampling multi-hop neighborhoods, GraphGen captures complex relationships that span across multiple sentences. Integrated style controls ensure the generated QA pairs remain varied and natural. This process results in a dataset specifically designed to teach the model information it does not already possess.

A web-based demo is available for those who wish to test the functionality before deployment.

To start the interface, run python webui/app.py. You can upload any text source—such as agricultural manuals, healthcare records, or scientific papers—and provide your LLM API key. The tool then generates training datasets formatted for Llama Factory and other popular fine-tuning frameworks. Once the process is complete, the tool automatically cleans and sanitizes your data.

Run from PyPI

  1. Install GraphGen:

    pip install graphg
    
  2. Configure the environment variables for both the synthesizer model (which constructs the graph) and the trainee model (the one you intend to fine-tune):

    export SYNTHESIZER_MODEL=your_synthesizer_model_name
    export SYNTHESIZER_BASE_URL=your_base_url
    export SYNTHESIZER_API_KEY=your_api_key
    export TRAINEE_MODEL=your_trainee_model_name
    export TRAINEE_BASE_URL=your_base_url
    export TRAINEE_API_KEY=your_api_key
    

    Execute the following command:

    graphg --output_dir cache
    

Run from Source

  1. Install the required dependencies:

    pip install -r requirements.txt
    
  2. Configure the environment by copying the example file:

    cp .env.example .env
    

    Edit the .env file and input the six variables listed in the PyPI section above.

  3. (Optional) Adjust the generation parameters in configs/graphgen_config.yaml.

  4. Execute the generation script:

    bash scripts/generate.sh
    
  5. Review the generated data:

    ls cache/data/graphgen
    

Run with Docker

  1. Build the Docker image:

    docker build -t graphgen .
    
  2. Launch the container:

    docker run -p 7860:7860 graphgen
    

How GraphGen Works

1. Knowledge Construction The system parses source documents to extract entities and their corresponding relationships, forming a detailed, fine-grained knowledge graph.

2. Understanding Assessment GraphGen evaluates the LLM's current understanding of the extracted graph. By calculating the Expected Calibration Error (ECE), it identifies exactly what the model knows and where it is uncertain. Nodes associated with high error rates are prioritized, as these represent the specific facts the model lacks.

3. Graph Organization The extracted knowledge is organized into clusters. Various sampling strategies then determine which subgraphs are the most relevant to convert into training data.

4. QA Generation

  • Atomic QA: Focuses on single-fact inquiries, such as "What is the boiling point of water at sea level?"
  • Aggregative QA: Requires the model to summarize, compare, or analyze multiple related facts.
  • Multi-hop QA: Generates complex questions that force the model to connect several disparate pieces of information to reach the correct answer.