Vision-Language-Action Models

LLM Training

Video Foundation Models

Image Tools

Dictionaries & Lexicons

Cryptocurrency Tools

Watermark Removal Tools

OCR Tools

Voice Interaction Models

AI Service Tools

ToolBoost >> OCR Tools >> DeepSeek OCR: Extract Text and Visual Data With This React FastAPI App

DeepSeek OCR: Extract Text and Visual Data With This React FastAPI App

10月27日 Published inOCR Tools

DeepSeek OCR: Smart Web Image Recognition With React and FastAPI

DeepSeek OCR integrates a React frontend with a FastAPI backend to provide streamlined text extraction and image analysis. The application offers four primary ways to interact with images: raw text extraction, intelligent scene description, visual term localization, and custom prompting for specialized tasks. It manages large files via drag-and-drop, outputs clean HTML or Markdown, and utilizes dynamic image cropping to prevent performance bottlenecks on high-resolution photos. Coordinates remain accurate through a standardized scaling system, and most configurations are handled via a .env file for easy deployment.

Getting Started

Clone the repository and configure the environment

git clone <repository-url>
cd deepseek_ocr_app

# Copy the example and edit the settings
cp .env.example .env
# Edit the .env file to configure ports, upload limits, and other flags

Launch the application

docker compose up --build

The initial run will download the model (approximately 5–10 GB). Download time depends on your network connection.

Access the services

• Frontend: http://localhost:3000 (or your defined FRONTEND_PORT)

• Backend API: http://localhost:8000 (or your defined API_PORT)

• API Documentation: http://localhost:8000/docs

Four OCR Modes

• Plain OCR: Extracts raw text from any image.

• Describe Mode: Generates a descriptive caption for the image content.

• Find Mode: Locates a specific term and highlights it with a bounding box.

• Freeform Mode: Processes custom prompts for unique user-defined tasks.

Frontend Features

• Glassmorphism UI with dynamic gradients.

• Drag-and-drop uploads (100 MB default limit).

• Quick-action buttons to remove or re-upload images.

• Precise bounding box overlays using scaled coordinates.

• Fluid animations powered by Framer Motion.

• Options to copy results to the clipboard or download them.

• Advanced settings accessible via a dropdown menu.

• Native rendering for both HTML and Markdown outputs.

• Support for multiple bounding boxes when terms appear repeatedly.

Configuration

The application is customized through the .env file. Key parameters include:

# API settings
API_HOST=0.0.0.0
API_PORT=8000

# Frontend port
FRONTEND_PORT=3000

# Model settings
MODEL_NAME=deepseek-ai/DeepSeek-OCR
HF_HOME=/models

# Upload limits
MAX_UPLOAD_SIZE_MB=100

# Processing parameters
BASE_SIZE=1024
IMAGE_SIZE=640
CROP_MODE=true

Parameter Details:

• API_HOST: Backend listener address (default: 0.0.0.0).

• API_PORT: Backend port (default: 8000).

• FRONTEND_PORT: Frontend port (default: 3000).

• MODEL_NAME: HuggingFace model identifier.

• HF_HOME: Directory for the model cache.

• MAX_UPLOAD_SIZE_MB: Maximum allowed file size in megabytes.

• BASE_SIZE: Base processing dimension (affects VRAM usage).

• IMAGE_SIZE: Tile size used for dynamic cropping.

• CROP_MODE: Toggles tiling for large images.

Hardware Requirements

• GPU: NVIDIA GPU with CUDA support.

• Recommended: RTX 3090, RTX 4090, RTX 5090, or higher.

• Minimum: 8–12 GB VRAM to load the model.

• Note: Higher VRAM results in faster processing and better stability.

Software Requirements

• Docker and Docker Compose (latest versions).

• NVIDIA Drivers.

Installing NVIDIA Drivers on Ubuntu (for RTX 5090)

Install the open-source driver (v580 or newer)

sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt update
sudo apt remove --purge nvidia*
sudo nvidia-installer --uninstall  # if removing an existing manual install
sudo apt autoremove
sudo apt install nvidia-driver-580-open

Upgrade the kernel to 6.11+ (Ubuntu 24.04 LTS)

sudo apt install --install-recommends linux-generic-hwe-24.04 linux-headers-generic-hwe-24.04
sudo update-initramfs -u
sudo apt autoremove
sudo reboot

Enable Resizable BAR in UEFI/BIOS

• Enter UEFI settings during boot (F2, Del, or F12).

• Enable "Resizable BAR" or "Smart Access Memory."

• This usually requires "Above 4G Decoding" to be enabled and "CSM" to be disabled.

• Save and exit.

Verify Installation

nvidia-smi

The RTX 5090 should appear in the device list.

NVIDIA Container Toolkit

Essential for providing Docker with GPU access. Follow the official installation guide at: [docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)

Additional Requirements

• Disk Space: ~20 GB for model weights and Docker images.

• System RAM: 16 GB or higher recommended.

• Internet: Stable connection for the initial 5–10 GB download.

Version History and Bug Fixes

Image State Management (v2.1.1)
- Issue: Users could not clear an uploaded image to start a new session.
- Fix: Added a "Remove" button that fully resets the image and result state.
Multi-Box Rendering (v2.1.1)
- Issue: Only one box would render even when multiple coordinates were provided.
- Fix: Updated parsing logic with ast.literal_eval to handle both single boxes and arrays.
Coordinate Scaling (v2.1)
- Issue: Bounding boxes were misaligned with the image.
- Fix: The model outputs normalized coordinates (0–999). The backend now scales these to actual pixels using: (normalized / 999) * image_dimension.
Rendering Logic (v2.1)
- Issue: HTML data (like tables) was incorrectly rendered as Markdown text.
- Fix: Implemented format detection and dangerouslySetInnerHTML for proper HTML rendering.
Nginx Payload Limits (v2.1)
- Issue: Large image uploads failed due to server-side restrictions.
- Fix: Set client_max_body_size to 100 MB, adjustable via .env.
Mode Optimization (v2.1.1)
- Change: Simplified to 4 core modes.
- Reason: Specialized modes (Layout, PII, etc.) are undergoing further validation and will be reintroduced later.

Technical Implementation

1. Coordinate System

DeepSeek OCR uses a normalized system (0–999) to ensure consistency. • Model outputs always fall within the [0, 999] range. • The backend scales these to the image's native resolution: pixel = (model_output / 999) * dimension.

2. Dynamic Tiling

To maintain accuracy on high-resolution images: • Small images (≤ 640x640) are processed as a single unit. • Larger images are split into tiles based on their aspect ratio. • The system combines a global overview (BASE_SIZE) with high-resolution local tiles (IMAGE_SIZE).

3. Output Formats

• Plain Text: Raw strings. • Table Mode: HTML or CSV representations. • JSON Mode: Structured data objects. • Localization: Tagged text using <|ref|> and <|det|> markers for object detection.

API Reference

Endpoint

POST /api/ocr

Parameters

Parameter	Type	Description
image	file	Required; image file up to 100 MB
mode	string	`plain_ocr`, `describe`, `find_ref`, or `freeform`
prompt	string	Custom prompt used for `freeform` mode
grounding	boolean	Enables bounding box data (forced for `find_ref`)
find_term	string	The specific term to locate in `find_ref` mode
base_size	integer	Base processing scale (default 1024)
image_size	integer	Tile size for cropping (default 640)
crop_mode	boolean	Toggles tiling logic (default true)

Sample Response

{
  "success": true,
  "text": "Extracted text content...",
  "boxes": [{"label": "term", "box": [x1, y1, x2, y2]}],
  "image_dims": {"w": 1920, "h": 1080},
  "metadata": {
    "mode": "find_ref",
    "grounding": true
  }
}

Use Case Examples

1. Complex Scene Analysis

Upload a photo of a crowded street. The app can identify specific text (like a "No Parking" sign) while simultaneously describing the scene: "A busy urban intersection at dusk with several vehicles and pedestrians."

2. Data Extraction

Upload a financial chart or a printed table. The system extracts the values and formats them as a clean HTML table, ready for use in spreadsheets or reports.

Troubleshooting

1. GPU Detection Issues

# Verify local drivers
nvidia-smi

# Test Docker GPU passthrough
docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu22.04 nvidia-smi

2. Port Conflicts

If the application fails to start, check if ports 3000 or 8000 are in use:

sudo lsof -i :3000
sudo lsof -i :8000

3. Frontend Build Errors

If the UI doesn't update or load correctly:

cd frontend
rm -rf node_modules package-lock.json
docker compose build frontend

▶ Visit

Related Tools

DeepSeek OCR: Extract Text and Visual Data With This React FastAPI App

DeepSeek-OCR: High-Speed Visual Text Compression That Actually Works

Jules Extension for Gemini CLI: Asynchronous Tasks and Automated GitHub PRs

Earth Copilot: Query Geospatial Data Using Natural Language

YPrompt Review: Build Better AI Prompts With This Smart Tool

Smart-Admin Setup Guide: Environment, Backend, Frontend, and Mobile Deployment

Build AI Agent Interfaces Faster with agents-ui-kit

LandPPT: Create AI-Powered Presentations from Any Document

Flameshot CLI Guide: Capture, Edit, and Upload Screenshots Rapidly

Chinese Kinship Calculator: Instantly Decode Family Relationship Terms

Mantis: A Smarter Vision-Language-Action Model for Robots

OpenThoughts-Agent: Train Small AI Models with HPC Scale

ClipSketch AI: Frame-Accurate Video Tagging & AI Storyboard Generation

Tencent HunyuanVideo-1.5: 8.3B Video Model Runs on 14GB GPUs

HiChunk Review: Smarter Chunking for RAG Pipelines

Build Agent Kurama: A Private Local Research Assistant with LangChain & Ollama

GRAG: Continuous Image Editing Control for DiT Models

AI Multi-Agent Stock Trading System: GPT-5 and Claude 4.5 Sonnet

Wan2.2-Animate: Local Setup Guide for Image-to-Video and Character Consistency

ReCode: Recursive Code Generation for LLM Agents

Vision-Language-Action Models

HiChunk Review: Smarter Chunking for RAG Pipelines

GRAG: Continuous Image Editing Control for DiT Models

MiMo-Audio: 100M-Hour Pretrained Model for Few-Shot Speech Tasks

Web Codegen Scorer: Test AI-Generated Web Code Quality Before You Ship

Smart-Admin Setup Guide: Environment, Backend, Frontend, and Mobile Deployment

Space Adventure Story Voice Mode: Build an AI-Powered Voice Game

Strapi Setup Guide: Local Development & Cloud Deployment

Transformers Library: Installation, Pipeline API, and Model Examples

Gmail AutoAuth MCP Server: Control Gmail via Claude Desktop

Build Web Apps Using Only SQL: A Guide to SQLPage

Cuby Text: Open-Source Block-Based Knowledge Management

MCP SuperAssistant: Bring MCP Tools to ChatGPT, Gemini, and Beyond