Fast RAG: Deploy a Private Hybrid Search RAG Stack Locally

9月23日 Published inRAG Tools

Fast RAG is a local, privacy-focused Retrieval-Augmented Generation system. It is built on FastAPI and utilizes PostgreSQL with pgvector for dense vector search, complemented by pg_trgm for sparse lexical matching. This combination creates a hybrid search engine capable of retrieving relevant text with much higher precision than using either method in isolation.

Docling manages the heavy lifting of document parsing, supporting PDFs, DOCX files, images, and PPTX decks. LangGraph provides the structure for the ingestion and query pipelines, ensuring a reliable flow of data. For the LLM layer, you can run local models via Ollama or plug in an OpenAI API key if preferred. The system supports real-time response streaming through Server-Sent Events (SSE). While a React-based frontend is included in the repository, the core is designed to be flexible and open to custom modifications.

Runs Locally. Your data and models never leave your hardware. There are no cloud dependencies and no privacy trade-offs to manage.

Hybrid Retrieval. By merging vector similarity with keyword search, the system identifies both conceptual matches and exact terminology within a single query.

Broad File Support. You can process PDFs, Word documents, PowerPoint decks, or images. Docling converts these various formats into plain text before the chunking process begins.

Fast Setup. A single Docker command initializes the entire stack, including the pgvector database. You also have the option to run it directly on your host machine.

Clean API. The system offers standard REST endpoints and SSE streaming, making it simple to build a custom UI or integrate the backend into your existing toolset.

Optional Frontend. The bundled React application provides a functional, ready-to-use interface right out of the box.

Local Installation

1. Install Python packages and download models

pip install -r requirements.txt
docling-tools models download

2. Configure environment variables

cp env.example .env
# Define your database connection strings in the .env file

3. Initialize the database (Skip this step if you are using Docker)

python scripts/init_db.py

4. Launch the server

python main.py

Once the server is active, you can access these addresses:

• Application: http://localhost:8000 • API Documentation: http://localhost:8000/docs • Redoc: http://localhost:8000/redoc

Deploy with Docker

The repository includes a comprehensive Docker configuration with pgvector pre-configured. A single command stands up both the database and the API server, allowing you to begin uploading documents immediately.

Optional Frontend

To run the React interface separately:

cd frontend-app
npm install
npm run dev

The development server will start at http://localhost:5173, communicating with the FastAPI backend running on port 8000.

Under the Hood

Fast RAG operates using a modular pipeline:

Document Flow: Upload → Convert → Split → Embed → Store.

Search Layer: The system combines vector similarity scores and lexical keyword hits to return the most contextually relevant document chunks.

Answer Generation: The retrieved context is passed to your selected LLM to generate a final answer. Streaming is enabled by default to reduce perceived latency.

LangGraph Integration: The orchestration layer remains transparent. You can inspect every step of the process or extend the graph with custom nodes to suit your specific workflow.