CodeIndexer: Semantic Code Search for IDEs (AI-Powered)

7月12日 Published inDeveloper Tools

CodeIndexer equips AI-powered IDEs with advanced indexing and deep context awareness. By combining vector databases like Milvus with industry-standard embedding models, it indexes your entire codebase to enable natural language discovery. This eliminates the limitations of traditional keyword searches and the context constraints of LLMs when working on large-scale projects. Rather than hunting for specific terms, you can describe the functionality you need and locate it instantly. The toolkit features context-aware discovery, AST-based smart chunking to preserve code structure, incremental file synchronization, and compatibility with various embedding providers. It operates through a core engine, a dedicated VS Code extension, and an MCP server.

What existing tools get wrong

• LLMs possess limited context windows, often failing to process large codebases effectively. • Regex and keyword searches overlook the structural relationships between code components. • Many IDEs lack genuine context awareness, failing to recognize how disparate parts of a project connect. • Developers frequently lose time navigating complex repositories manually. • Traditional search tools cannot bridge the gap between human intent and the underlying source code.

How CodeIndexer fixes it

Context awareness – It recognizes the relationships between different modules and functions across the codebase. • Semantic search – You can use plain English queries, such as “find the function that handles user authentication,” to locate relevant code. • AI-driven logic – The system understands code intent and structural connectivity rather than just matching text strings. • Cross-platform utility – Support for the Model Context Protocol (MCP) and a VS Code extension ensures it fits into various development environments.

Core features

Semantic code search – Locate code based on its purpose. For example, search for “functions that interact with vector databases” without needing to know specific variable or function names.

Smart indexing – Automatically processes the entire codebase to build a semantic vector database enriched with structural context.

Context-aware discovery – Identifies related code snippets based on functional meaning rather than literal text matches.

Incremental file sync – Utilizes Merkle trees for efficient change detection, ensuring that only modified files are re-indexed.

Smart chunking – Employs AST-based splitting to maintain code logic and context, with an automatic fallback mechanism for unsupported formats.

Accelerated development – Reduces the time spent searching for existing logic, allowing more focus on active feature development.

Multiple embedding services – Compatible with OpenAI, VoyageAI, Ollama, and other leading providers.

Vector storage – Optimized for use with Milvus or Zilliz Cloud (fully managed).

VS Code integration – Includes a native extension designed to integrate into your existing coding workflow.

MCP support – Features a Model Context Protocol server to facilitate interactions with AI agents.

Progress tracking – Provides real-time status updates throughout the indexing process.

Customizable – Offers granular control over file extensions, ignore patterns, and choice of embedding models.

Architecture

User interfaces

Available in three formats: a Chrome extension (for specific workflows), a VS Code extension, and an MCP server.

Core components

CodeIndexer uses a monorepo architecture consisting of three primary packages: • @code-indexer/core – The central engine that manages embeddings and vector database integration. • VSCode Extension – Provides semantic search capabilities directly within Visual Studio Code. • @code-indexer/mcp – A Model Context Protocol server designed for AI agent communication.

Supported technologies

Embedding services: OpenAI, VoyageAI, Ollama Vector databases: Milvus or Zilliz Cloud (fully managed) Code splitters: AST-based (with automatic fallback) and LangChain character splitters.

Languages: TypeScript, JavaScript, Python, Java, C++, C#, Go, Rust, PHP, Ruby, Swift, Kotlin, Scala, and Markdown.

Dev tools: VS Code, Model Context Protocol (MCP).

Quick start

Prerequisites

• Node.js ≥ 20.0.0 • pnpm ≥ 10.0.0 • An active Milvus database • An API key from OpenAI or VoyageAI

Installation

# Using npm
npm install @code-indexer/core

# Using pnpm
pnpm add @code-indexer/core

# Using yarn
yarn add @code-indexer/core

Environment variables

OpenAI API key – Obtain your key from the OpenAI dashboard and set: OPENAI_API_KEY=your-openai-api-key

Milvus configuration – For Zilliz Cloud (the fully managed version of Milvus, which offers a free tier): MILVUS_ADDRESS = your Zilliz Cloud instance’s public endpoint MILVUS_TOKEN = your Zilliz Cloud token

MILVUS_ADDRESS=https://xxx-xxxxxxxxxxxx.serverless.gcp-us-west1.cloud.zilliz.com
MILVUS_TOKEN=xxxxxxx

If you are hosting your own Milvus instance, configure the address and token to match your local setup.

Basic usage example

The @code-indexer/core package serves as the primary engine for embeddings, vector storage, and search operations.

import { CodeIndexer, MilvusVectorDatabase, OpenAIEmbedding } from '@code-indexer/core';

const embedding = new OpenAIEmbedding({
    apiKey: process.env.OPENAI_API_KEY || 'your-openai-api-key',
    model: 'text-embedding-3-small'
});

const vectorDatabase = new MilvusVectorDatabase({
    address: process.env.MILVUS_ADDRESS || 'localhost:19530',
    token: process.env.MILVUS_TOKEN || ''
});

const indexer = new CodeIndexer({ embedding, vectorDatabase });

const stats = await indexer.indexCodebase('./your-project', (progress) => {
    console.log(`${progress.phase} - ${progress.percentage}%`);
});
console.log(`Indexed ${stats.indexedFiles} files, ${stats.totalChunks} chunks`);

const results = await indexer.semanticSearch('./your-project', 'vector database operations', 5);
results.forEach(result => {
    console.log(`File: ${result.relativePath}:${result.startLine}-${result.endLine}`);
    console.log(`Score: ${(result.score * 100).toFixed(2)}%`);
    console.log(`Content: ${result.content.substring(0, 100)}...`);
});

Extension packages

The following packages extend the functionality of @code-indexer/core. Each package includes comprehensive documentation and usage examples.

@code-indexer/mcp

This is a Model Context Protocol server that allows AI assistants and agents to communicate with CodeIndexer. It exposes indexing and search capabilities as standard MCP tools.

Configuration for different tools

Cursor – Add the following to ~/.cursor/mcp.json (global) or your project-specific .cursor/mcp.json:

{
  "mcpServers": {
    "code-indexer": {
      "command": "npx",
      "args": ["-y", "@code-indexer/mcp@latest"],
      "env": {
        "OPENAI_API_KEY": "your-openai-api-key",
        "MILVUS_ADDRESS": "localhost:19530"
      }
    }
  }
}

Claude Desktop – Add this to the configuration file:

{
  "mcpServers": {
    "code-indexer": {
      "command": "npx",
      "args": ["@code-indexer/mcp@latest"],
      "env": {
        "OPENAI_API_KEY": "your-openai-api-key",
        "MILVUS_ADDRESS": "localhost:19530"
      }
    }
  }
}

Other tools (Claude Code, Windsurf, VS Code, Cherry Studio, etc.) – Follow a similar configuration pattern. The universal execution command is: npx @code-indexer/mcp@latest

VS Code extension

You can install the extension directly from the VS Code Marketplace. Search for “Semantic Code Search” in the Extensions view (Ctrl+Shift+X or Cmd+Shift+X) and select Install.

Development setup

git clone https://github.com/zilliztech/CodeIndexer.git
cd CodeIndexer
pnpm install
pnpm build
pnpm dev

Build commands

pnpm build          # Build all packages
pnpm build:core     # Build core engine only
pnpm build:vscode   # Build VS Code extension only
pnpm build:mcp      # Build MCP server only

Run examples

cd examples/basic-usage
pnpm dev

Supported file extensions

Languages: .ts, .tsx, .js, .jsx, .py, .java, .cpp, .c, .h, .hpp, .cs, .go, .rs, .php, .rb, .swift, .kt, .scala, .m, .mm Documentation: .md, .markdown

Ignored patterns

CodeIndexer automatically excludes the following directories and files: node_modules/**, dist/**, build/**, .git/**, .vscode/**, .idea/**, *.log, *.min.js, *.map