Graph-Code: Query Your Codebase via Natural Language with LLM-Powered RAG

6月27日 Published inRAG Tools

Graph-Code is a multilingual, graph-based Retrieval-Augmented Generation (RAG) system designed to make complex codebases searchable through natural language. It leverages Tree-sitter to parse the Abstract Syntax Trees (AST) of repositories written in Python, JavaScript, TypeScript, Rust, Go, Scala, and Java. Because the parser is language-agnostic, it can consistently extract structural data, functional relationships, and external dependencies, storing them within a Memgraph database as a unified knowledge graph.

By integrating Google Gemini or local models via Ollama, Graph-Code translates natural language questions into precise Cypher queries. This allows developers to explore internal code relationships intuitively, retrieve specific source snippets, and navigate nested logic without manual grep-ing or constant context switching. The system maintains a consistent schema across all supported languages, ensuring a uniform experience regardless of the stack.

Core Features

Extensive Language Support: Compatible with Python, JavaScript, TypeScript, Rust, Go, Scala, and Java repositories.

Tree-sitter Parsing: Utilizes reliable, language-agnostic AST extraction to map code structure.

Knowledge Graph Storage: Stores code elements and their interconnections as nodes and edges within Memgraph.

Natural Language Interface: Allows users to interrogate a codebase using plain English queries.

Automated Cypher Generation: Converts English prompts into graph queries using cloud models (Google Gemini) or local alternatives (Ollama).

Source Snippet Retrieval: Directly fetches the implementation code for identified functions, methods, or classes.

Dependency Mapping: Analyzes pyproject.toml and similar files to map external package dependencies.

Nested Logic Handling: Accurately represents complex class hierarchies and nested function definitions.

Unified Schema: All supported languages are mapped to a standardized graph model.

Architecture

The system is divided into two primary modules:

  • Multilingual Parser: A Tree-sitter–based engine that scans repositories and populates the Memgraph database.
  • RAG System (codebase_rag/): An interactive command-line interface for querying the knowledge graph.

Core Components

Tree-sitter Integration: Performs language-agnostic parsing through dedicated grammars.

Graph Database: Memgraph serves as the storage layer for code nodes and their relationships.

LLM Integration: Supports Google Gemini for cloud-based processing and Ollama for local execution.

Code Analysis Engine: Employs advanced AST traversal to identify and link code elements across different languages.

Query Utilities: Specialized tools for executing graph searches and retrieving source code.

Configurable Mappings: Language-specific parsing rules are fully configurable.

Requirements

  • Python 3.12 or higher.
  • Docker and Docker Compose (to run Memgraph).
  • A Google Gemini API key (for cloud-based models).
  • Ollama installed and running (for local models).
  • The uv package manager for dependency handling.

Installation

  1. Clone the repository:

    git clone https://github.com/vitali87/code-graph-rag.git
    cd code-graph-rag
    
  2. Install dependencies:

    • For standard Python support:

      uv sync
      
    • For full multilingual support:

      uv sync --extra treesitter-full
      

    This installs Tree-sitter grammars for: Python (.py), JavaScript (.js, .jsx), TypeScript (.ts, .tsx), Rust (.rs), Go (.go), Scala (.scala, .sc), and Java (.java).

  3. Configure environment variables:

    cp .env.example .env
    # Edit .env with your specific configuration
    
  4. Model Configuration Options:

    • Option 1: Cloud Model (Gemini)

      # .env file
      LLM_PROVIDER=gemini
      GEMINI_API_KEY=your_gemini_api_key_here
      

      You can obtain a free API key from Google AI Studio.

    • Option 2: Local Model (Ollama)

      # .env file
      LLM_PROVIDER=local
      LOCAL_MODEL_ENDPOINT=http://localhost:11434/v1
      LOCAL_ORCHESTRATOR_MODEL_ID=llama3
      LOCAL_CYPHER_MODEL_ID=llama3
      LOCAL_MODEL_API_KEY=ollama
      

      To set up Ollama:

      # Install Ollama (macOS/Linux)
      curl -fsSL https://ollama.ai/install.sh | sh
      
      # Download a model
      ollama pull llama3
      # Other options include: llama3.1, mistral, or codellama
      

      Local models ensure data privacy and eliminate API costs, though accuracy may vary compared to Gemini.

  5. Start Memgraph:

    docker-compose up -d
    

Usage

Step 1: Parse a Repository

Ingest a codebase into the knowledge graph.

  • To initialize a new graph:

    python -m codebase_rag.main --repo-path ../../ai-engineering-hub/video-rag-gemini --update-graph --clean
    
  • To add more repositories to the existing graph:

    python -m codebase_rag.main --repo-path /path/to/repo2 --update-graph
    python -m codebase_rag.main --repo-path /path/to/repo3 --update-graph
    

The system automatically identifies languages based on file extensions.

Step 2: Query the Codebase

Launch the interactive RAG command-line tool:

python -m codebase_rag.main --repo-path /path/to/your/repo
Runtime Model Switching

You can toggle between cloud and local providers via CLI arguments:

  • Run with a local model:

    python -m codebase_rag.main --repo-path /path/to/your/repo --llm-provider local
    
  • Run with a cloud model:

    python -m codebase_rag.main --repo-path /path/to/your/repo --llm-provider gemini
    
  • Specify custom model IDs:

    # Using specific local models
    python -m codebase_rag.main --repo-path /path/to/your/repo \
      --llm-provider local \
      --orchestrator-model llama3.1 \
      --cypher-model codellama
    
    # Using specific Gemini models
    python -m codebase_rag.main --repo-path /path/to/your/repo \
      --llm-provider gemini \
      --orchestrator-model gemini-2.0-flash-thinking-exp-01-21 \
      --cypher-model gemini-2.5-flash-lite-preview-06-17
    
Available CLI Arguments
  • --llm-provider: Select either gemini or local.
  • --orchestrator-model: The model responsible for RAG orchestration.
  • --cypher-model: The model dedicated to generating Cypher queries.
Example Queries
  • "Show all classes that contain 'user' in their name."
  • "Find functions related to database operations."
  • "What methods does the User class have?"
  • "Show functions that handle authentication."
  • "List all TypeScript components."
  • "Find Rust structs and their methods."
  • "Show Go interfaces and their implementations."

Graph Schema

The knowledge graph is organized using specific node types and relationships.

Node Types

  • Project: The root node representing the repository.
  • Package: Represents language-specific packages (e.g., Python folders containing __init__.py).
  • Module: A single source file.
  • Class: Represents classes, structs, and enums.
  • Function: Used for standalone or module-level functions.
  • Method: Functions associated with a class or object.
  • Folder: A directory in the file system.
  • File: Any file within the repository (source code or otherwise).
  • ExternalPackage: A third-party dependency.

Language-Specific Mappings

  • Python: function_definition, class_definition
  • JavaScript/TypeScript: function_declaration, arrow_function, class_declaration
  • Rust: function_item, struct_item, enum_item, impl_item
  • Go: function_declaration, method_declaration, type_declaration
  • Scala: function_definition, class_definition, object_definition, trait_definition
  • Java: method_declaration, class_declaration, interface_declaration, enum_declaration

Relationships

  • CONTAINS_PACKAGE / CONTAINS_MODULE / CONTAINS_FILE / CONTAINS_FOLDER: Defines the file system hierarchy.
  • DEFINES: Indicates that a module contains a specific class or function.
  • DEFINES_METHOD: Indicates that a class contains a specific method.
  • DEPENDS_ON_EXTERNAL: Links the project to its external dependencies.

Configuration

System settings are managed through environment variables in the .env file.

Required Settings

  • LLM_PROVIDER: Set to "gemini" for cloud or "local" for local models.

Gemini (Cloud) Configuration

  • GEMINI_API_KEY: Required for Gemini access.
  • GEMINI_MODEL_ID: The primary orchestrator (Default: gemini-2.5-pro-preview-06-05).
  • MODEL_CYPHER_ID: The Cypher generation model (Default: gemini-2.5-flash-lite-preview-06-17).

Local Model Configuration

  • LOCAL_MODEL_ENDPOINT: Ollama API URL (Default: http://localhost:11434/v1).
  • LOCAL_ORCHESTRATOR_MODEL_ID: Orchestration model (Default: llama3).
  • LOCAL_CYPHER_MODEL_ID: Cypher generation model (Default: llama3).
  • LOCAL_MODEL_API_KEY: API key for local usage (Default: ollama).

Additional Settings

  • MEMGRAPH_HOST: Database hostname (Default: localhost).
  • MEMGRAPH_PORT: Database port (Default: 7687).
  • TARGET_REPO_PATH: The default repository path (Default: .).

Primary Dependencies

  • tree-sitter: The core parsing engine.
  • tree-sitter-{language}: Individual grammars for supported languages.
  • pydantic-ai: The framework used for AI agent orchestration.
  • pymgclient: Python client for Memgraph.
  • loguru: For structured logging.
  • python-dotenv: For managing environment variables.

Supported Languages and Capabilities

Language Extensions Functions Classes/Structs Modules Package Detection
Python .py __init__.py
JavaScript .js, .jsx -
TypeScript .ts, .tsx -
Rust .rs ✅ (struct/enum) -
Go .go ✅ (struct) -
Scala .scala, .sc ✅ (class/obj) Package decl.
Java .java ✅ (class/intf) Package decl.

Language-Specific Features

  • Python: Deep support for nested functions, methods, and complex package structures.
  • JavaScript/TypeScript: Supports standard function declarations, arrow functions, and ES6 classes.
  • Rust: Captures functions, structs, enums, implementation blocks, and associated functions.
  • Go: Maps functions, methods, type declarations, and struct definitions.
  • Scala: Compatible with Scala 3 syntax, including traits, objects, and case classes.
  • Java: Extracts methods, constructors, interfaces, enums, and annotations.

Installation Options

# Base Python support only
uv sync

# Full multilingual support (recommended)
uv sync --extra treesitter-full

# Add specific languages manually
uv add tree-sitter-python tree-sitter-rust tree-sitter-go

Language Configuration

The system is configuration-driven. Each language is defined in codebase_rag/language_config.py, which specifies file extensions, node types, and naming conventions. Adding support for a new language typically only requires updates to this configuration file.

Debugging

  • Memgraph Connection:
    • Ensure the container is active via docker-compose ps.
    • Confirm Memgraph is listening on port 7687.
  • Memgraph Lab:
    • Access the visual interface at http://localhost:3000 to inspect the graph directly.
  • Local Models:
    • Verify Ollama is running with ollama list.
    • Ensure your target model is downloaded via ollama pull llama3.
    • Test the API response with curl http://localhost:11434/v1/models.