DeepDoc Turns Local Files Into AI Research Reports (No Cloud Needed)

9月8日 Published inDocument Analysis Tools

DeepDoc is a local research tool designed to analyze and extract insights from your private files. It processes a variety of formats—including PDFs, Word documents, scans, and text files—by "chunking" the content and indexing it in a vector database. When you submit a query in plain English, DeepDoc generates a content outline and deploys a multi-agent AI system to search your local data, refine the findings, and synthesize the information. The result is a structured, insight-dense Markdown report. Because the system runs entirely on your hardware, no data is uploaded to the cloud, and you maintain full control over the LLM configurations.

How It Works Under the Hood

  1. Ingest: You provide the files. PDF, DOCX, JPG, and TXT formats are all supported.
  2. Extract: The system parses the raw text and segments it into page-based chunks.
  3. Store: These chunks are indexed in a vector database to power semantic search.
  4. Outline: DeepDoc proposes a report structure based on your specific query.
  5. Refine: You provide feedback, and the system adjusts the outline accordingly.
  6. Assign: The tool generates specific chapters and topics for each section.
  7. Investigate: For every chapter, a dedicated research agent performs the following:
    • Identifies core facts.
    • Drafts targeted search queries.
    • Executes searches across the chunked local data.
    • Uses a reflection agent to audit and sharpen the results.
    • Writes the final text for that section.
  8. Compile: Individual chapters are sent to the report writer for assembly.
  9. Output: The system generates a complete Markdown report.

The Workflow Pipeline

The Local DeepResearcher module operates on two inputs: your files and your instructions.

  • Input: Files (PDF, DOCX, JPG, TXT) and a natural language query.
  • Processing:
    • Text Extractor: Extracts raw text from the source files.
    • Chapter Generator: Develops an outline based on your query and iterative feedback.
    • Research Agent: A multi-stage workflow consisting of a knowledge builder, query writer, search executor, reflection critic, and section writer.
  • Output: The Final Report Generator delivers a polished, formatted document.
  • Backbone: A vector database stores the text chunks, serving as the engine for search and analysis.

Deploying DeepDoc Locally

Prerequisite: Install uv DeepDoc uses uv to manage virtual environments and dependencies. Download it from the official uv GitHub repository and follow the installation instructions for your operating system.

1. Clone the Repository

git clone https://github.com/Datalore-ai/deepdoc.git
cd deepdoc

2. Create a Virtual Environment

uv venv

3. Activate the Environment

  • Windows:
    .venv\Scripts\activate
    
  • macOS / Linux:
    source .venv/bin/activate
    

4. Set Environment Variables Copy the template file and configure your keys:

cp .env.example .env

You must add your API keys for the tool to function:

MISTRAL_API_KEY=
TAVILY_API_KEY=
OPENAI_API_KEY=

# Default settings
QDRANT_URL=http://localhost:6333
COLLECTION_NAME=knowledge_base
EMBEDDING_MODEL=BAAI/bge-small-en-v1.5
QDRANT_DISABLE_THREADING=true

5. Install Dependencies

uv pip install -r requirements.txt

6. Start Qdrant with Docker Ensure Docker and Docker Compose are running, then start the services:

docker-compose up --build

This initializes the vector database in the background.

7. Launch the Application

python main.py

The CLI will guide you through the dataset creation process. Completed reports are saved in the output_files folder.

Customizing Behavior

You can modify DeepDoc’s operational logic in configuration.py. Two main blocks control the system's behavior:

import uuid

# LLM settings
LLM_CONFIG = {
    "provider": "openai",
    "model": "gpt-4o-mini",
    "temperature": 0.5,  # Adjust for creativity vs. precision
}

# Research loop parameters
THREAD_CONFIG = {
    "configurable": {
        "thread_id": str(uuid.uuid4()),
        "max_queries": 3,
        "search_depth": 2,
        "num_reflections": 2,
        "n_points": 1,
    }
}

You can switch models, adjust the "temperature" to control randomness, or increase num_reflections to force the agents to perform more rigorous quality checks on the data they retrieve.