Crawl4AI RAG MCP Server: Web Crawling and Vector Search for AI Agents

5月22日 Published inWeb Scraping Tools

The Crawl4AI RAG MCP Server bridges the Model Context Protocol (MCP) with Crawl4AI and Supabase. This integration provides AI agents and coding assistants with robust web crawling and retrieval-augmented generation (RAG) capabilities. With this server, an agent can navigate a website, ingest its content into a vector database, and perform semantic queries against that stored knowledge.

Core Features

  1. Smart URL Detection: The system automatically identifies and processes standard web pages, sitemaps, text files, and other common web formats.
  2. Recursive Crawling: It follows internal links to discover and scrape content deep within a site’s architecture.
  3. Parallel Processing: High-concurrency crawling allows multiple pages to be processed simultaneously, significantly reducing execution time.
  4. Content Chunking: Text is intelligently split based on headers and size constraints to prepare data for optimal embedding.
  5. Vector Search: You can run RAG queries against any crawled content, with the option to filter by source domain for more precise results.
  6. Source Retrieval: The server can provide a list of all indexed domains in the database to help refine RAG prompts.

Main Tools

  • crawl_single_page: Extracts data from a single URL and pushes it directly to the vector store.
  • smart_crawl_url: Analyzes the input type—whether a sitemap, llms-full.txt, or a standard page—and executes an intelligent site-wide crawl.
  • get_available_sources: Retrieves a list of all domains currently indexed within the Supabase database.
  • perform_rag_query: Executes a semantic search across your crawled data. Results can be narrowed down to a specific domain if needed.

Installation

Prerequisites

  • Docker or Docker Desktop (recommended for containerized deployment).
  • Python 3.12+ (if running the server locally via uv).
  • An active Supabase project to serve as your vector store.
  • An OpenAI API key for generating text embeddings.
Setup Steps

1. Docker Deployment (Recommended)

  • Clone the repository: git clone https://github.com/coleam00/mcp-crawl4ai-rag.git
  • Navigate to the directory: cd mcp-crawl4ai-rag
  • Build the Docker image: docker build -t mcp/crawl4ai-rag --build-arg PORT=8051 .
  • Create a .env file following the configuration guidelines below.

2. Local Setup with uv

  • Clone the repository: git clone https://github.com/coleam00/mcp-crawl4ai-rag.git
  • Navigate to the directory: cd mcp-crawl4ai-rag
  • Install uv if it is not already on your system: pip install uv
  • Initialize and activate a virtual environment:
    • Windows: uv venv, then .venv\Scripts\activate
    • macOS/Linux: source .venv/bin/activate
  • Install dependencies: uv pip install -e ., followed by crawl4ai-setup
  • Create your .env file.

Database Configuration

  • Open the SQL Editor in your Supabase dashboard.
  • Copy the contents of the crawled_pages.sql file from the repository.
  • Paste the script into a new query and run it to initialize the necessary tables and functions.

Environment Variables

Create a .env file in the project root with the following variables:

# MCP Server Configuration
HOST=0.0.0.0
PORT=8051
TRANSPORT=sse

# OpenAI Configuration
OPENAI_API_KEY=your-openai-key

# Supabase Configuration
SUPABASE_URL=your-supabase-project-url
SUPABASE_SERVICE_KEY=your-supabase-service-key

Launching the Server

  • Via Docker: docker run --env-file .env -p 8051:8051 mcp/crawl4ai-rag
  • Via Python: uv run src/crawl4ai_mcp.py

The server will now listen on the host and port defined in your environment variables.

MCP Client Integration

SSE Configuration

To connect via the Server-Sent Events (SSE) transport, use the following configuration in your client:

{
  "mcpServers": {
    "crawl4ai-rag": {
      "transport": "sse",
      "url": "http://localhost:8051/sse"
    }
  }
}

Note for Windsurf users: Use serverUrl instead of url.

Note for Docker users: If your client is running in a separate container, replace localhost with host.docker.internal.

Stdio Configuration

For integration with Claude Desktop, Windsurf, or other standard MCP clients:

{
  "mcpServers": {
    "crawl4ai-rag": {
      "command": "python",
      "args": ["path/to/crawl4ai-mcp/src/crawl4ai_mcp.py"],
      "env": {
        "TRANSPORT": "stdio",
        "OPENAI_API_KEY": "your-openai-key",
        "SUPABASE_URL": "your-supabase-url",
        "SUPABASE_SERVICE_KEY": "your-supabase-service-key"
      }
    }
  }
}

Docker with Stdio

{
  "mcpServers": {
    "crawl4ai-rag": {
      "command": "docker",
      "args": ["run", "--rm", "-i",
               "-e", "TRANSPORT",
               "-e", "OPENAI_API_KEY",
               "-e", "SUPABASE_URL",
               "-e", "SUPABASE_SERVICE_KEY",
               "mcp/crawl4ai-rag"],
      "env": {
        "TRANSPORT": "stdio",
        "OPENAI_API_KEY": "your-openai-key",
        "SUPABASE_URL": "your-supabase-url",
        "SUPABASE_SERVICE_KEY": "your-supabase-service-key"
      }
    }
  }
}

Customizing Your Server

This implementation provides a modular foundation for an MCP server with integrated web crawling. You can adapt it to your specific needs by:

  • Adding new functionality using the @mcp.tool() decorator.
  • Extending lifecycle methods to inject custom dependencies.
  • Adding specialized helper functions within utils.py.
  • Developing custom crawlers tailored to specific website structures or proprietary content formats.