DeepSeek OCR integrates a React frontend with a FastAPI backend to provide streamlined text extraction and image analysis. The application offers four primary ways to interact with images: raw text extraction, intelligent scene description, visual term localization, and custom prompting for specialized tasks. It manages large files via drag-and-drop, outputs clean HTML or Markdown, and utilizes dynamic image cropping to prevent performance bottlenecks on high-resolution photos. Coordinates remain accurate through a standardized scaling system, and most configurations are handled via a .env file for easy deployment.
git clone <repository-url>
cd deepseek_ocr_app
# Copy the example and edit the settings
cp .env.example .env
# Edit the .env file to configure ports, upload limits, and other flags
docker compose up --build
The initial run will download the model (approximately 5–10 GB). Download time depends on your network connection.
• Frontend: http://localhost:3000 (or your defined FRONTEND_PORT)
• Backend API: http://localhost:8000 (or your defined API_PORT)
• API Documentation: http://localhost:8000/docs
• Plain OCR: Extracts raw text from any image.
• Describe Mode: Generates a descriptive caption for the image content.
• Find Mode: Locates a specific term and highlights it with a bounding box.
• Freeform Mode: Processes custom prompts for unique user-defined tasks.
• Glassmorphism UI with dynamic gradients.
• Drag-and-drop uploads (100 MB default limit).
• Quick-action buttons to remove or re-upload images.
• Precise bounding box overlays using scaled coordinates.
• Fluid animations powered by Framer Motion.
• Options to copy results to the clipboard or download them.
• Advanced settings accessible via a dropdown menu.
• Native rendering for both HTML and Markdown outputs.
• Support for multiple bounding boxes when terms appear repeatedly.
The application is customized through the .env file. Key parameters include:
# API settings
API_HOST=0.0.0.0
API_PORT=8000
# Frontend port
FRONTEND_PORT=3000
# Model settings
MODEL_NAME=deepseek-ai/DeepSeek-OCR
HF_HOME=/models
# Upload limits
MAX_UPLOAD_SIZE_MB=100
# Processing parameters
BASE_SIZE=1024
IMAGE_SIZE=640
CROP_MODE=true
Parameter Details:
• API_HOST: Backend listener address (default: 0.0.0.0).
• API_PORT: Backend port (default: 8000).
• FRONTEND_PORT: Frontend port (default: 3000).
• MODEL_NAME: HuggingFace model identifier.
• HF_HOME: Directory for the model cache.
• MAX_UPLOAD_SIZE_MB: Maximum allowed file size in megabytes.
• BASE_SIZE: Base processing dimension (affects VRAM usage).
• IMAGE_SIZE: Tile size used for dynamic cropping.
• CROP_MODE: Toggles tiling for large images.
• GPU: NVIDIA GPU with CUDA support.
• Recommended: RTX 3090, RTX 4090, RTX 5090, or higher.
• Minimum: 8–12 GB VRAM to load the model.
• Note: Higher VRAM results in faster processing and better stability.
• Docker and Docker Compose (latest versions).
• NVIDIA Drivers.
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt update
sudo apt remove --purge nvidia*
sudo nvidia-installer --uninstall # if removing an existing manual install
sudo apt autoremove
sudo apt install nvidia-driver-580-open
sudo apt install --install-recommends linux-generic-hwe-24.04 linux-headers-generic-hwe-24.04
sudo update-initramfs -u
sudo apt autoremove
sudo reboot
• Enter UEFI settings during boot (F2, Del, or F12).
• Enable "Resizable BAR" or "Smart Access Memory."
• This usually requires "Above 4G Decoding" to be enabled and "CSM" to be disabled.
• Save and exit.
nvidia-smi
The RTX 5090 should appear in the device list.
Essential for providing Docker with GPU access. Follow the official installation guide at: [docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)
• Disk Space: ~20 GB for model weights and Docker images.
• System RAM: 16 GB or higher recommended.
• Internet: Stable connection for the initial 5–10 GB download.
Image State Management (v2.1.1)
Multi-Box Rendering (v2.1.1)
ast.literal_eval to handle both single boxes and arrays.Coordinate Scaling (v2.1)
(normalized / 999) * image_dimension.Rendering Logic (v2.1)
dangerouslySetInnerHTML for proper HTML rendering.Nginx Payload Limits (v2.1)
client_max_body_size to 100 MB, adjustable via .env.Mode Optimization (v2.1.1)
DeepSeek OCR uses a normalized system (0–999) to ensure consistency.
• Model outputs always fall within the [0, 999] range.
• The backend scales these to the image's native resolution: pixel = (model_output / 999) * dimension.
To maintain accuracy on high-resolution images:
• Small images (≤ 640x640) are processed as a single unit.
• Larger images are split into tiles based on their aspect ratio.
• The system combines a global overview (BASE_SIZE) with high-resolution local tiles (IMAGE_SIZE).
• Plain Text: Raw strings.
• Table Mode: HTML or CSV representations.
• JSON Mode: Structured data objects.
• Localization: Tagged text using <|ref|> and <|det|> markers for object detection.
POST /api/ocr
| Parameter | Type | Description |
|---|---|---|
| image | file | Required; image file up to 100 MB |
| mode | string | plain_ocr, describe, find_ref, or freeform |
| prompt | string | Custom prompt used for freeform mode |
| grounding | boolean | Enables bounding box data (forced for find_ref) |
| find_term | string | The specific term to locate in find_ref mode |
| base_size | integer | Base processing scale (default 1024) |
| image_size | integer | Tile size for cropping (default 640) |
| crop_mode | boolean | Toggles tiling logic (default true) |
{
"success": true,
"text": "Extracted text content...",
"boxes": [{"label": "term", "box": [x1, y1, x2, y2]}],
"image_dims": {"w": 1920, "h": 1080},
"metadata": {
"mode": "find_ref",
"grounding": true
}
}
Upload a photo of a crowded street. The app can identify specific text (like a "No Parking" sign) while simultaneously describing the scene: "A busy urban intersection at dusk with several vehicles and pedestrians."
Upload a financial chart or a printed table. The system extracts the values and formats them as a clean HTML table, ready for use in spreadsheets or reports.
# Verify local drivers
nvidia-smi
# Test Docker GPU passthrough
docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu22.04 nvidia-smi
If the application fails to start, check if ports 3000 or 8000 are in use:
sudo lsof -i :3000
sudo lsof -i :8000
If the UI doesn't update or load correctly:
cd frontend
rm -rf node_modules package-lock.json
docker compose build frontend
Jules Extension for Gemini CLI: Asynchronous Tasks and Automated GitHub PRs
AI Podcast Transcriber Turns Audio Into Clean Text and Smart Summaries
Kodi Setup Guide: Building a Powerful Media Center on Any Device
Chatterbox TTS API: Open Source Text-to-Speech for Developers
Chinese Kinship Calculator: Instantly Decode Family Relationship Terms
Mevzuat MCP: Search Turkish Legislation Directly in Claude
Teable: The Self-Hosted, PostgreSQL-Based Airtable Alternative
Memvid: Store Millions of Text Chunks in a Single MP4 File
II-Agent Review: An Open-Source LLM Assistant Built for Autonomous Tasks
SuperCoder: A Terminal-Based Coding Assistant for Searching, Editing, and Debugging
How to Build a Meeting Prep Agent with Tavily and Google Calendar
ChatTTS: A Text-to-Speech Model Optimized for Dialogue