Fast RAG is a local, privacy-focused Retrieval-Augmented Generation system. It is built on FastAPI and utilizes PostgreSQL with pgvector for dense vector search, complemented by pg_trgm for sparse lexical matching. This combination creates a hybrid search engine capable of retrieving relevant text with much higher precision than using either method in isolation.
Docling manages the heavy lifting of document parsing, supporting PDFs, DOCX files, images, and PPTX decks. LangGraph provides the structure for the ingestion and query pipelines, ensuring a reliable flow of data. For the LLM layer, you can run local models via Ollama or plug in an OpenAI API key if preferred. The system supports real-time response streaming through Server-Sent Events (SSE). While a React-based frontend is included in the repository, the core is designed to be flexible and open to custom modifications.
Runs Locally. Your data and models never leave your hardware. There are no cloud dependencies and no privacy trade-offs to manage.
Hybrid Retrieval. By merging vector similarity with keyword search, the system identifies both conceptual matches and exact terminology within a single query.
Broad File Support. You can process PDFs, Word documents, PowerPoint decks, or images. Docling converts these various formats into plain text before the chunking process begins.
Fast Setup. A single Docker command initializes the entire stack, including the pgvector database. You also have the option to run it directly on your host machine.
Clean API. The system offers standard REST endpoints and SSE streaming, making it simple to build a custom UI or integrate the backend into your existing toolset.
Optional Frontend. The bundled React application provides a functional, ready-to-use interface right out of the box.
1. Install Python packages and download models
pip install -r requirements.txt
docling-tools models download
2. Configure environment variables
cp env.example .env
# Define your database connection strings in the .env file
3. Initialize the database (Skip this step if you are using Docker)
python scripts/init_db.py
4. Launch the server
python main.py
Once the server is active, you can access these addresses:
• Application: http://localhost:8000
• API Documentation: http://localhost:8000/docs
• Redoc: http://localhost:8000/redoc
The repository includes a comprehensive Docker configuration with pgvector pre-configured. A single command stands up both the database and the API server, allowing you to begin uploading documents immediately.
To run the React interface separately:
cd frontend-app
npm install
npm run dev
The development server will start at http://localhost:5173, communicating with the FastAPI backend running on port 8000.
Fast RAG operates using a modular pipeline:
Document Flow: Upload → Convert → Split → Embed → Store.
Search Layer: The system combines vector similarity scores and lexical keyword hits to return the most contextually relevant document chunks.
Answer Generation: The retrieved context is passed to your selected LLM to generate a final answer. Streaming is enabled by default to reduce perceived latency.
LangGraph Integration: The orchestration layer remains transparent. You can inspect every step of the process or extend the graph with custom nodes to suit your specific workflow.
OpenThoughts-Agent: Train Small AI Models with HPC Scale
HiChunk Review: Smarter Chunking for RAG Pipelines
Any-LLM Review: A Unified Python Interface for Every AI Model
OxyGent: Build Multi-Agent Systems That Learn and Scale Without YAML
AI-Powered Stock Research Generator with Automated Financial Charting
Trae Agent: Run Complex Dev Workflows With Plain English Prompts
Fooocus: Free Offline SDXL Image Generator & Installation Guide
How to Install and Use Vosk Offline Speech Recognition
Seelen UI Setup: Customizing the Windows Desktop with YAML and Tiling
II-Agent Review: An Open-Source LLM Assistant Built for Autonomous Tasks
SmartPDF: Summarize PDFs with Llama 3.3
Cnchar: A Lightweight JavaScript Library for Pinyin, Stroke Order & Idioms