DeepSeek OCR: Extract Text and Visual Data With This React FastAPI App

10月27日 Published inOCR Tools

DeepSeek OCR: Smart Web Image Recognition With React and FastAPI

DeepSeek OCR integrates a React frontend with a FastAPI backend to provide streamlined text extraction and image analysis. The application offers four primary ways to interact with images: raw text extraction, intelligent scene description, visual term localization, and custom prompting for specialized tasks. It manages large files via drag-and-drop, outputs clean HTML or Markdown, and utilizes dynamic image cropping to prevent performance bottlenecks on high-resolution photos. Coordinates remain accurate through a standardized scaling system, and most configurations are handled via a .env file for easy deployment.

Getting Started

  1. Clone the repository and configure the environment
git clone <repository-url>
cd deepseek_ocr_app

# Copy the example and edit the settings
cp .env.example .env
# Edit the .env file to configure ports, upload limits, and other flags
  1. Launch the application
docker compose up --build

The initial run will download the model (approximately 5–10 GB). Download time depends on your network connection.

  1. Access the services

Frontend: http://localhost:3000 (or your defined FRONTEND_PORT)

Backend API: http://localhost:8000 (or your defined API_PORT)

API Documentation: http://localhost:8000/docs

Four OCR Modes

Plain OCR: Extracts raw text from any image.

Describe Mode: Generates a descriptive caption for the image content.

Find Mode: Locates a specific term and highlights it with a bounding box.

Freeform Mode: Processes custom prompts for unique user-defined tasks.

Frontend Features

• Glassmorphism UI with dynamic gradients.

• Drag-and-drop uploads (100 MB default limit).

• Quick-action buttons to remove or re-upload images.

• Precise bounding box overlays using scaled coordinates.

• Fluid animations powered by Framer Motion.

• Options to copy results to the clipboard or download them.

• Advanced settings accessible via a dropdown menu.

• Native rendering for both HTML and Markdown outputs.

• Support for multiple bounding boxes when terms appear repeatedly.

Configuration

The application is customized through the .env file. Key parameters include:

# API settings
API_HOST=0.0.0.0
API_PORT=8000

# Frontend port
FRONTEND_PORT=3000

# Model settings
MODEL_NAME=deepseek-ai/DeepSeek-OCR
HF_HOME=/models

# Upload limits
MAX_UPLOAD_SIZE_MB=100

# Processing parameters
BASE_SIZE=1024
IMAGE_SIZE=640
CROP_MODE=true

Parameter Details:

API_HOST: Backend listener address (default: 0.0.0.0).

API_PORT: Backend port (default: 8000).

FRONTEND_PORT: Frontend port (default: 3000).

MODEL_NAME: HuggingFace model identifier.

HF_HOME: Directory for the model cache.

MAX_UPLOAD_SIZE_MB: Maximum allowed file size in megabytes.

BASE_SIZE: Base processing dimension (affects VRAM usage).

IMAGE_SIZE: Tile size used for dynamic cropping.

CROP_MODE: Toggles tiling for large images.

Hardware Requirements

GPU: NVIDIA GPU with CUDA support.

Recommended: RTX 3090, RTX 4090, RTX 5090, or higher.

Minimum: 8–12 GB VRAM to load the model.

Note: Higher VRAM results in faster processing and better stability.

Software Requirements

• Docker and Docker Compose (latest versions).

• NVIDIA Drivers.

Installing NVIDIA Drivers on Ubuntu (for RTX 5090)
  1. Install the open-source driver (v580 or newer)
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt update
sudo apt remove --purge nvidia*
sudo nvidia-installer --uninstall  # if removing an existing manual install
sudo apt autoremove
sudo apt install nvidia-driver-580-open
  1. Upgrade the kernel to 6.11+ (Ubuntu 24.04 LTS)
sudo apt install --install-recommends linux-generic-hwe-24.04 linux-headers-generic-hwe-24.04
sudo update-initramfs -u
sudo apt autoremove
sudo reboot
  1. Enable Resizable BAR in UEFI/BIOS

• Enter UEFI settings during boot (F2, Del, or F12).

• Enable "Resizable BAR" or "Smart Access Memory."

• This usually requires "Above 4G Decoding" to be enabled and "CSM" to be disabled.

• Save and exit.

  1. Verify Installation
nvidia-smi

The RTX 5090 should appear in the device list.

NVIDIA Container Toolkit

Essential for providing Docker with GPU access. Follow the official installation guide at: [docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)

Additional Requirements

Disk Space: ~20 GB for model weights and Docker images.

System RAM: 16 GB or higher recommended.

Internet: Stable connection for the initial 5–10 GB download.

Version History and Bug Fixes

  1. Image State Management (v2.1.1)

    • Issue: Users could not clear an uploaded image to start a new session.
    • Fix: Added a "Remove" button that fully resets the image and result state.
  2. Multi-Box Rendering (v2.1.1)

    • Issue: Only one box would render even when multiple coordinates were provided.
    • Fix: Updated parsing logic with ast.literal_eval to handle both single boxes and arrays.
  3. Coordinate Scaling (v2.1)

    • Issue: Bounding boxes were misaligned with the image.
    • Fix: The model outputs normalized coordinates (0–999). The backend now scales these to actual pixels using: (normalized / 999) * image_dimension.
  4. Rendering Logic (v2.1)

    • Issue: HTML data (like tables) was incorrectly rendered as Markdown text.
    • Fix: Implemented format detection and dangerouslySetInnerHTML for proper HTML rendering.
  5. Nginx Payload Limits (v2.1)

    • Issue: Large image uploads failed due to server-side restrictions.
    • Fix: Set client_max_body_size to 100 MB, adjustable via .env.
  6. Mode Optimization (v2.1.1)

    • Change: Simplified to 4 core modes.
    • Reason: Specialized modes (Layout, PII, etc.) are undergoing further validation and will be reintroduced later.

Technical Implementation

1. Coordinate System

DeepSeek OCR uses a normalized system (0–999) to ensure consistency. • Model outputs always fall within the [0, 999] range. • The backend scales these to the image's native resolution: pixel = (model_output / 999) * dimension.

2. Dynamic Tiling

To maintain accuracy on high-resolution images: • Small images (≤ 640x640) are processed as a single unit. • Larger images are split into tiles based on their aspect ratio. • The system combines a global overview (BASE_SIZE) with high-resolution local tiles (IMAGE_SIZE).

3. Output Formats

Plain Text: Raw strings. • Table Mode: HTML or CSV representations. • JSON Mode: Structured data objects. • Localization: Tagged text using <|ref|> and <|det|> markers for object detection.

API Reference

Endpoint

POST /api/ocr

Parameters

Parameter Type Description
image file Required; image file up to 100 MB
mode string plain_ocr, describe, find_ref, or freeform
prompt string Custom prompt used for freeform mode
grounding boolean Enables bounding box data (forced for find_ref)
find_term string The specific term to locate in find_ref mode
base_size integer Base processing scale (default 1024)
image_size integer Tile size for cropping (default 640)
crop_mode boolean Toggles tiling logic (default true)

Sample Response

{
  "success": true,
  "text": "Extracted text content...",
  "boxes": [{"label": "term", "box": [x1, y1, x2, y2]}],
  "image_dims": {"w": 1920, "h": 1080},
  "metadata": {
    "mode": "find_ref",
    "grounding": true
  }
}

Use Case Examples

1. Complex Scene Analysis

Upload a photo of a crowded street. The app can identify specific text (like a "No Parking" sign) while simultaneously describing the scene: "A busy urban intersection at dusk with several vehicles and pedestrians."

2. Data Extraction

Upload a financial chart or a printed table. The system extracts the values and formats them as a clean HTML table, ready for use in spreadsheets or reports.

Troubleshooting

1. GPU Detection Issues

# Verify local drivers
nvidia-smi

# Test Docker GPU passthrough
docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu22.04 nvidia-smi

2. Port Conflicts

If the application fails to start, check if ports 3000 or 8000 are in use:

sudo lsof -i :3000
sudo lsof -i :8000

3. Frontend Build Errors

If the UI doesn't update or load correctly:

cd frontend
rm -rf node_modules package-lock.json
docker compose build frontend