DeepSeek-OCR: High-Speed Visual Text Compression That Actually Works

10月23日 Published inOCR Tools

DeepSeek-OCR is an open-source model designed for visual text compression. It manages OCR tasks across a diverse range of resolutions—from small thumbnails to high-detail dynamic scans—without compromising throughput. The system processes batches of PDFs or images to generate clean Markdown, structured chart data, detailed descriptions, or precise text coordinates.

The model supports both vLLM and Hugging Face Transformers. For applications where speed is critical, vLLM is the preferred choice; on an A100-40G, PDF processing averages roughly 2,500 tokens per second.

Performance Benchmarks

The Fox benchmark measures the average number of visual tokens required per image. While models like InternVL3-78B and OCRFlux-3B often consume 1,500 tokens or more for a single image, DeepSeek-OCR is significantly more efficient. The entire lineup—Tiny, Small, Base, Large, and Gundam—stays under 1,000 tokens, with some variants consuming far fewer. Lower token counts result in reduced compute requirements and faster processing.

In the Omnidocbench benchmark, the model maintains high accuracy even as text density increases. Across a range of 600 to 1,300 tokens per page, the various DeepSeek-OCR versions consistently deliver strong recognition rates without performance degradation.

Installation

Begin by configuring the environment. Requirements include CUDA 11.8 and PyTorch 2.6.0.

git clone https://github.com/deepseek-ai/DeepSeek-OCR.git
cd DeepSeek-OCR
conda create -n deepseek-ocr python=3.12.9 -y
conda activate deepseek-ocr

Download the vLLM wheel file (vllm-0.8.5) before installing the remaining dependencies.

pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118
pip install vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl
pip install -r requirements.txt
pip install flash-attn==2.7.3 --no-build-isolation

This setup allows vLLM and Transformers code to run side-by-side without dependency conflicts or version mismatches.

Inference Options

To use vLLM, update the INPUT_PATH and OUTPUT_PATH in DeepSeek-OCR-master/DeepSeek-OCR-vllm/config.py, then run the appropriate script:

  • Images: python run_dpsk_ocr_image.py (streaming output)
  • PDFs: python run_dpsk_ocr_pdf.py
  • Batch Evaluations: python run_dpsk_ocr_eval_batch.py

To use the Transformers library, integrate the following logic into your Python script:

from transformers import AutoModel, AutoTokenizer
import torch
import os
os.environ["CUDA_VISIBLE_DEVICES"] = '0'

model_name = 'deepseek-ai/DeepSeek-OCR'
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, _attn_implementation='flash_attention_2', trust_remote_code=True, use_safetensors=True)
model = model.eval().cuda().to(torch.bfloat16)

prompt = "<image>\n<|grounding|>Convert the document to markdown."
image_file = 'your_image.jpg'
output_path = 'your/output/dir'

res = model.infer(tokenizer, prompt=prompt, image_file=image_file, output_path=output_path, base_size=1024, image_size=640, crop_mode=True, save_results=True, test_compress=True)

Alternatively, navigate to DeepSeek-OCR-master/DeepSeek-OCR-hf and execute python run_dpsk_ocr.py.

Resolution Modes

  • Fixed Sizes: Tiny (512×512, 64 tokens), Small (640×640, 100 tokens), Base (1024×1024, 256 tokens), Large (1280×1280, 400 tokens).
  • Dynamic: Gundam scales based on content using an $n \times 640 \times 640 + 1 \times 1024 \times 1024$ resolution scheme.

Prompt Guide

  • Markdown Conversion: <image>\n<|grounding|>Convert the document to markdown.
  • Standard OCR: <image>\n<|grounding|>OCR this image.
  • Raw OCR (No Layout Parsing): <image>\nFree OCR.
  • Data Extraction: <image>\nParse the figure.
  • Scene Analysis: <image>\nDescribe this image in detail.
  • Text Localization: <image>\nLocate <|ref|>xxxx<|/ref|> in the image.
  • Multilingual Support: <image>\n先天下之忧而忧 (Chinese text support)

Practical Applications

DeepSeek-OCR excels in real-world scenarios. It can convert an 8th-grade geometry proof into clean, editable text or transform a dense economic chart with Eurozone projections into a sorted table, preserving all titles and sources. Bilingual educational materials are parsed with both text and context intact. Consumer product labels, such as a jar of doubanjiang, yield specific details like "Net weight: 500g." Even street scenes are handled with precision; the model can read text on a small sticker while simultaneously describing the surrounding environment and vehicles.

For literary analysis, the prompt <image>\n<|grounding|>OCR the image accurately extracts lines from classical Chinese poetry, such as Jiang Jin Jiu, without disrupting the traditional verse structure. By minimizing token waste while delivering high-fidelity results, DeepSeek-OCR provides a highly efficient path from raw imagery to structured data.