DeepSeek-OCR is an open-source model designed for visual text compression. It manages OCR tasks across a diverse range of resolutions—from small thumbnails to high-detail dynamic scans—without compromising throughput. The system processes batches of PDFs or images to generate clean Markdown, structured chart data, detailed descriptions, or precise text coordinates.
The model supports both vLLM and Hugging Face Transformers. For applications where speed is critical, vLLM is the preferred choice; on an A100-40G, PDF processing averages roughly 2,500 tokens per second.
Performance Benchmarks
The Fox benchmark measures the average number of visual tokens required per image. While models like InternVL3-78B and OCRFlux-3B often consume 1,500 tokens or more for a single image, DeepSeek-OCR is significantly more efficient. The entire lineup—Tiny, Small, Base, Large, and Gundam—stays under 1,000 tokens, with some variants consuming far fewer. Lower token counts result in reduced compute requirements and faster processing.
In the Omnidocbench benchmark, the model maintains high accuracy even as text density increases. Across a range of 600 to 1,300 tokens per page, the various DeepSeek-OCR versions consistently deliver strong recognition rates without performance degradation.
Installation
Begin by configuring the environment. Requirements include CUDA 11.8 and PyTorch 2.6.0.
git clone https://github.com/deepseek-ai/DeepSeek-OCR.git
cd DeepSeek-OCR
conda create -n deepseek-ocr python=3.12.9 -y
conda activate deepseek-ocr
Download the vLLM wheel file (vllm-0.8.5) before installing the remaining dependencies.
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118
pip install vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl
pip install -r requirements.txt
pip install flash-attn==2.7.3 --no-build-isolation
This setup allows vLLM and Transformers code to run side-by-side without dependency conflicts or version mismatches.
Inference Options
To use vLLM, update the INPUT_PATH and OUTPUT_PATH in DeepSeek-OCR-master/DeepSeek-OCR-vllm/config.py, then run the appropriate script:
python run_dpsk_ocr_image.py (streaming output)python run_dpsk_ocr_pdf.pypython run_dpsk_ocr_eval_batch.pyTo use the Transformers library, integrate the following logic into your Python script:
from transformers import AutoModel, AutoTokenizer
import torch
import os
os.environ["CUDA_VISIBLE_DEVICES"] = '0'
model_name = 'deepseek-ai/DeepSeek-OCR'
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, _attn_implementation='flash_attention_2', trust_remote_code=True, use_safetensors=True)
model = model.eval().cuda().to(torch.bfloat16)
prompt = "<image>\n<|grounding|>Convert the document to markdown."
image_file = 'your_image.jpg'
output_path = 'your/output/dir'
res = model.infer(tokenizer, prompt=prompt, image_file=image_file, output_path=output_path, base_size=1024, image_size=640, crop_mode=True, save_results=True, test_compress=True)
Alternatively, navigate to DeepSeek-OCR-master/DeepSeek-OCR-hf and execute python run_dpsk_ocr.py.
Resolution Modes
Prompt Guide
<image>\n<|grounding|>Convert the document to markdown.<image>\n<|grounding|>OCR this image.<image>\nFree OCR.<image>\nParse the figure.<image>\nDescribe this image in detail.<image>\nLocate <|ref|>xxxx<|/ref|> in the image.<image>\n先天下之忧而忧 (Chinese text support)Practical Applications
DeepSeek-OCR excels in real-world scenarios. It can convert an 8th-grade geometry proof into clean, editable text or transform a dense economic chart with Eurozone projections into a sorted table, preserving all titles and sources. Bilingual educational materials are parsed with both text and context intact. Consumer product labels, such as a jar of doubanjiang, yield specific details like "Net weight: 500g." Even street scenes are handled with precision; the model can read text on a small sticker while simultaneously describing the surrounding environment and vehicles.
For literary analysis, the prompt <image>\n<|grounding|>OCR the image accurately extracts lines from classical Chinese poetry, such as Jiang Jin Jiu, without disrupting the traditional verse structure. By minimizing token waste while delivering high-fidelity results, DeepSeek-OCR provides a highly efficient path from raw imagery to structured data.
Mantis: A Smarter Vision-Language-Action Model for Robots
DupCheck: Open-Source Image Duplication & Tampering Detection (Python)
Feiniao VPN: Free Trial, 4K Streaming & Unblock Netflix (2026 Guide)
Any-LLM Review: A Unified Python Interface for Every AI Model
Gemini-CLI-UI: A Web Interface for the Google Gemini CLI Coding Assistant
Zettlr Setup and Developer Guide (macOS, Windows, Linux)
Memvid: Store Millions of Text Chunks in a Single MP4 File
ChineseBQB: The Ultimate Archive of Chinese Memes—Search, Download, and Win Every Group Chat
ALLinSSL: Automated SSL Certificate Lifecycle Management
Notes: An Open-Source C++ Markdown App with Kanban Support
DBeaver: A Free Cross-Platform Database Tool (Plus CloudBeaver)
Cnchar: A Lightweight JavaScript Library for Pinyin, Stroke Order & Idioms