Paperless GPT: Smarter OCR and Auto-Tagging for Paperless-NGX

6月9日 Published inText Tools

Paperless GPT serves as the intelligence layer for your Paperless-NGX installation. It eliminates the tedious nature of manual filing by using AI to generate accurate document titles and tags. Rather than simply scanning pixels, it combines traditional OCR with large language models (LLMs). This ensures that even when you process a messy scan or a degraded fax, you receive clean, usable text in return.

The system integrates easily with OpenAI, Ollama, and other backends. Traditional OCR engines frequently struggle with wrinkled receipts or skewed invoices, but Paperless GPT uses context to fix jumbled characters and broken line endings. The resulting output is precise enough for reliable searching and sorting.

Choice of OCR Engines

You aren't restricted to a single provider. You can select the engine that best fits your privacy requirements or budget:

  • LLM OCR: Routes images directly to OpenAI or Ollama for text extraction.
  • Google Document AI: Utilizes Google’s enterprise-grade document parsing.
  • Azure Document Intelligence: Leverages Microsoft’s advanced document processing.
  • Docling Server: A self-hosted conversion service for maximum privacy.

Automated Metadata That Works

  • Titles and Tags: The AI suggests names and categories for each file. You can manually override these suggestions if the model misses a specific detail.
  • Correspondent Detection: Automatically identifies the sender of a letter, bill, or notice.
  • Date Generation: Determines the document's creation date to keep your digital archive chronologically organized.

The tool generates a PDF containing an invisible text layer. While the document's appearance remains identical to the original scan, you gain the ability to highlight and search for specific words. These files can be stored locally or pushed directly back into Paperless-NGX.

Processing Modes

  • Image Mode (Default): Converts every page into an image format. This mode is compatible with all providers.
  • PDF Mode: Processes raw page data directly. This is often faster and produces sharper results for text-based PDFs.
  • Whole Page PDF Mode: Submits the entire document as a single unit. This is highly effective for reducing API costs when handling large files.

Environment Setup

You can customize the system’s behavior using environment variables:

  • LLM Selection: Connect to OpenAI, Mistral, or Ollama and choose specific models such as gpt-4o or qwen3:8b.
  • OCR Rules: Configure your preferred provider, processing mode, and whether you require hOCR output.
  • Safety Guards: Limit the number of processed pages via OCR_LIMIT_PAGES or skip files that already contain text using PDF_SKIP_EXISTING_OCR.

Deploy with Docker

You can run Paperless GPT alongside Paperless-NGX using Docker Compose. Simply configure your API details and launch the service:

services:
  paperless-gpt:
    image: icereed/paperless-gpt:latest
    environment:
      PAPERLESS_BASE_URL: "http://paperless-ngx:8000"
      PAPERLESS_API_TOKEN: "your_api_token"
      LLM_PROVIDER: "openai"
      LLM_MODEL: "gpt-4o"
      OPENAI_API_KEY: "your_openai_key"
    volumes:
      - ./prompts:/app/prompts
    ports:
      - "8080:8080"
    depends_on:
      - paperless-ngx

Advanced Features

  • hOCR and Text Layers: Google Document AI is the recommended choice for generating hOCR files and ensuring precise text positioning for searchability.
  • Local Storage: Use the CREATE_LOCAL_HOCR and CREATE_LOCAL_PDF settings to maintain copies of all processed files on your local disk.

Practical Advice

  • Avoid Blind Replacements: The PDF_REPLACE function overwrites the original file. It is wise to back up your data before enabling this.
  • Use Completion Tags: Configure PDF_OCR_COMPLETE_TAG so the system identifies processed files and avoids scanning the same document twice.
  • Monitor Token Usage: If you are using a local model like Ollama, adjust the TOKEN_LIMIT to ensure long contracts aren't truncated during processing.

The Real-World Difference

Example 1: A Crinkled Receipt Standard OCR: Distorts the address. "LOUVAIN LA NEUVE" might appear as gibberish like "LOLNAIN LA NEWWE." LLM OCR (GPT-4o): Recognizes the context, corrects the spelling to "Louvain-la-Neuve," and accurately aligns the financial totals.

Example 2: A Messy Invoice Standard OCR: Scrambles table layouts and misinterprets date formats. LLM OCR (GPT-4o): Preserves the table structure, cleans up dates and currency symbols, and makes the document readable without manual correction.