Vision-Language-Action Models

LLM Training

Video Foundation Models

Image Tools

Dictionaries & Lexicons

Cryptocurrency Tools

Watermark Removal Tools

OCR Tools

Voice Interaction Models

AI Service Tools

ToolBoost >> RAG Tools >> Graph-Code: Query Your Codebase via Natural Language with LLM-Powered RAG

Graph-Code: Query Your Codebase via Natural Language with LLM-Powered RAG

6月27日 Published inRAG Tools

Graph-Code is a multilingual, graph-based Retrieval-Augmented Generation (RAG) system designed to make complex codebases searchable through natural language. It leverages Tree-sitter to parse the Abstract Syntax Trees (AST) of repositories written in Python, JavaScript, TypeScript, Rust, Go, Scala, and Java. Because the parser is language-agnostic, it can consistently extract structural data, functional relationships, and external dependencies, storing them within a Memgraph database as a unified knowledge graph.

By integrating Google Gemini or local models via Ollama, Graph-Code translates natural language questions into precise Cypher queries. This allows developers to explore internal code relationships intuitively, retrieve specific source snippets, and navigate nested logic without manual grep-ing or constant context switching. The system maintains a consistent schema across all supported languages, ensuring a uniform experience regardless of the stack.

Core Features

Extensive Language Support: Compatible with Python, JavaScript, TypeScript, Rust, Go, Scala, and Java repositories.

Tree-sitter Parsing: Utilizes reliable, language-agnostic AST extraction to map code structure.

Knowledge Graph Storage: Stores code elements and their interconnections as nodes and edges within Memgraph.

Natural Language Interface: Allows users to interrogate a codebase using plain English queries.

Automated Cypher Generation: Converts English prompts into graph queries using cloud models (Google Gemini) or local alternatives (Ollama).

Source Snippet Retrieval: Directly fetches the implementation code for identified functions, methods, or classes.

Dependency Mapping: Analyzes pyproject.toml and similar files to map external package dependencies.

Nested Logic Handling: Accurately represents complex class hierarchies and nested function definitions.

Unified Schema: All supported languages are mapped to a standardized graph model.

Architecture

The system is divided into two primary modules:

Multilingual Parser: A Tree-sitter–based engine that scans repositories and populates the Memgraph database.
RAG System (codebase_rag/): An interactive command-line interface for querying the knowledge graph.

Core Components

Tree-sitter Integration: Performs language-agnostic parsing through dedicated grammars.

Graph Database: Memgraph serves as the storage layer for code nodes and their relationships.

LLM Integration: Supports Google Gemini for cloud-based processing and Ollama for local execution.

Code Analysis Engine: Employs advanced AST traversal to identify and link code elements across different languages.

Query Utilities: Specialized tools for executing graph searches and retrieving source code.

Configurable Mappings: Language-specific parsing rules are fully configurable.

Requirements

Python 3.12 or higher.
Docker and Docker Compose (to run Memgraph).
A Google Gemini API key (for cloud-based models).
Ollama installed and running (for local models).
The uv package manager for dependency handling.

Installation

Clone the repository:

git clone https://github.com/vitali87/code-graph-rag.git
cd code-graph-rag

Install dependencies:
- For standard Python support:
```
uv sync
```
- For full multilingual support:
```
uv sync --extra treesitter-full
```
This installs Tree-sitter grammars for: Python (.py), JavaScript (.js, .jsx), TypeScript (.ts, .tsx), Rust (.rs), Go (.go), Scala (.scala, .sc), and Java (.java).

Configure environment variables:

cp .env.example .env
# Edit .env with your specific configuration

Model Configuration Options:

Option 1: Cloud Model (Gemini)
```
# .env file
LLM_PROVIDER=gemini
GEMINI_API_KEY=your_gemini_api_key_here
```
You can obtain a free API key from Google AI Studio.

Option 2: Local Model (Ollama)

# .env file
LLM_PROVIDER=local
LOCAL_MODEL_ENDPOINT=http://localhost:11434/v1
LOCAL_ORCHESTRATOR_MODEL_ID=llama3
LOCAL_CYPHER_MODEL_ID=llama3
LOCAL_MODEL_API_KEY=ollama

To set up Ollama:

# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.ai/install.sh | sh

# Download a model
ollama pull llama3
# Other options include: llama3.1, mistral, or codellama

Local models ensure data privacy and eliminate API costs, though accuracy may vary compared to Gemini.

Start Memgraph:
```
docker-compose up -d
```

Usage

Step 1: Parse a Repository

Ingest a codebase into the knowledge graph.

To initialize a new graph:

python -m codebase_rag.main --repo-path ../../ai-engineering-hub/video-rag-gemini --update-graph --clean

To add more repositories to the existing graph:

python -m codebase_rag.main --repo-path /path/to/repo2 --update-graph
python -m codebase_rag.main --repo-path /path/to/repo3 --update-graph

The system automatically identifies languages based on file extensions.

Step 2: Query the Codebase

Launch the interactive RAG command-line tool:

python -m codebase_rag.main --repo-path /path/to/your/repo

Runtime Model Switching

You can toggle between cloud and local providers via CLI arguments:

Run with a local model:

python -m codebase_rag.main --repo-path /path/to/your/repo --llm-provider local

Run with a cloud model:

python -m codebase_rag.main --repo-path /path/to/your/repo --llm-provider gemini

Specify custom model IDs:

# Using specific local models
python -m codebase_rag.main --repo-path /path/to/your/repo \
  --llm-provider local \
  --orchestrator-model llama3.1 \
  --cypher-model codellama

# Using specific Gemini models
python -m codebase_rag.main --repo-path /path/to/your/repo \
  --llm-provider gemini \
  --orchestrator-model gemini-2.0-flash-thinking-exp-01-21 \
  --cypher-model gemini-2.5-flash-lite-preview-06-17

Available CLI Arguments

--llm-provider: Select either gemini or local.
--orchestrator-model: The model responsible for RAG orchestration.
--cypher-model: The model dedicated to generating Cypher queries.

Example Queries

"Show all classes that contain 'user' in their name."
"Find functions related to database operations."
"What methods does the User class have?"
"Show functions that handle authentication."
"List all TypeScript components."
"Find Rust structs and their methods."
"Show Go interfaces and their implementations."

Graph Schema

The knowledge graph is organized using specific node types and relationships.

Node Types

Project: The root node representing the repository.
Package: Represents language-specific packages (e.g., Python folders containing __init__.py).
Module: A single source file.
Class: Represents classes, structs, and enums.
Function: Used for standalone or module-level functions.
Method: Functions associated with a class or object.
Folder: A directory in the file system.
File: Any file within the repository (source code or otherwise).
ExternalPackage: A third-party dependency.

Language-Specific Mappings

Python: function_definition, class_definition
JavaScript/TypeScript: function_declaration, arrow_function, class_declaration
Rust: function_item, struct_item, enum_item, impl_item
Go: function_declaration, method_declaration, type_declaration
Scala: function_definition, class_definition, object_definition, trait_definition
Java: method_declaration, class_declaration, interface_declaration, enum_declaration

Relationships

CONTAINS_PACKAGE / CONTAINS_MODULE / CONTAINS_FILE / CONTAINS_FOLDER: Defines the file system hierarchy.
DEFINES: Indicates that a module contains a specific class or function.
DEFINES_METHOD: Indicates that a class contains a specific method.
DEPENDS_ON_EXTERNAL: Links the project to its external dependencies.

Configuration

System settings are managed through environment variables in the .env file.

Required Settings

LLM_PROVIDER: Set to "gemini" for cloud or "local" for local models.

Gemini (Cloud) Configuration

GEMINI_API_KEY: Required for Gemini access.
GEMINI_MODEL_ID: The primary orchestrator (Default: gemini-2.5-pro-preview-06-05).
MODEL_CYPHER_ID: The Cypher generation model (Default: gemini-2.5-flash-lite-preview-06-17).

Local Model Configuration

LOCAL_MODEL_ENDPOINT: Ollama API URL (Default: http://localhost:11434/v1).
LOCAL_ORCHESTRATOR_MODEL_ID: Orchestration model (Default: llama3).
LOCAL_CYPHER_MODEL_ID: Cypher generation model (Default: llama3).
LOCAL_MODEL_API_KEY: API key for local usage (Default: ollama).

Additional Settings

MEMGRAPH_HOST: Database hostname (Default: localhost).
MEMGRAPH_PORT: Database port (Default: 7687).
TARGET_REPO_PATH: The default repository path (Default: .).

Primary Dependencies

tree-sitter: The core parsing engine.
tree-sitter-{language}: Individual grammars for supported languages.
pydantic-ai: The framework used for AI agent orchestration.
pymgclient: Python client for Memgraph.
loguru: For structured logging.
python-dotenv: For managing environment variables.

Supported Languages and Capabilities

Language	Extensions	Functions	Classes/Structs	Modules	Package Detection
Python	`.py`	✅	✅	✅	`__init__.py`
JavaScript	`.js`, `.jsx`	✅	✅	✅	-
TypeScript	`.ts`, `.tsx`	✅	✅	✅	-
Rust	`.rs`	✅	✅ (struct/enum)	✅	-
Go	`.go`	✅	✅ (struct)	✅	-
Scala	`.scala`, `.sc`	✅	✅ (class/obj)	✅	Package decl.
Java	`.java`	✅	✅ (class/intf)	✅	Package decl.

Language-Specific Features

Python: Deep support for nested functions, methods, and complex package structures.
JavaScript/TypeScript: Supports standard function declarations, arrow functions, and ES6 classes.
Rust: Captures functions, structs, enums, implementation blocks, and associated functions.
Go: Maps functions, methods, type declarations, and struct definitions.
Scala: Compatible with Scala 3 syntax, including traits, objects, and case classes.
Java: Extracts methods, constructors, interfaces, enums, and annotations.

Installation Options

# Base Python support only
uv sync

# Full multilingual support (recommended)
uv sync --extra treesitter-full

# Add specific languages manually
uv add tree-sitter-python tree-sitter-rust tree-sitter-go

Language Configuration

The system is configuration-driven. Each language is defined in codebase_rag/language_config.py, which specifies file extensions, node types, and naming conventions. Adding support for a new language typically only requires updates to this configuration file.

Debugging

Memgraph Connection:
- Ensure the container is active via docker-compose ps.
- Confirm Memgraph is listening on port 7687.
Memgraph Lab:
- Access the visual interface at http://localhost:3000 to inspect the graph directly.
Local Models:
- Verify Ollama is running with ollama list.
- Ensure your target model is downloaded via ollama pull llama3.
- Test the API response with curl http://localhost:11434/v1/models.