Click to upload or drag and drop
Format: JPG, JPEG, PNG, GIF, WEBP
Size: Up to 10MB, Max resolution: 4096x4096
Uploading: demo2.wav
File size: 295.54 KB
Uploading to server...
demo2.wav
File size: 295.54 KB
✓ Upload successful
Are you sure you want to sign out?
Stop retyping! Instantly convert scanned docs, screenshots, and PDFs into editable, searchable text—powered by 2D optical mapping AI.
Click to upload or drag and drop
Format: JPG, JPEG, PNG, GIF, WEBP
Size: Up to 10MB, Max resolution: 4096x4096
Uploading: demo2.wav
File size: 295.54 KB
Uploading to server...
demo2.wav
File size: 295.54 KB
✓ Upload successful
DeepSeek AI has unveiled DeepSeek-OCR, a groundbreaking approach to compressing long contexts via optical 2D mapping. This innovative system demonstrates that vision-based compression can achieve remarkable efficiency in handling text-heavy documents, potentially revolutionizing how large language models (LLMs) process extensive textual information.
The DeepSeek-OCR system consists of two primary components: DeepEncoder and DeepSeek3B-MoE-A570M as the decoder. Together, they achieve an impressive 97% OCR precision when compressing text at a ratio of less than 10× (meaning 10 text tokens compressed into 1 vision token). Even at an aggressive 20× compression ratio, the system maintains approximately 60% accuracy.
The core innovation of DeepSeek-OCR lies in its ability to compress textual information dramatically while maintaining high accuracy:
These results demonstrate that compact language models can effectively decode compressed visual representations, suggesting that larger LLMs could readily acquire similar capabilities through appropriate pretraining design.
DeepEncoder represents a novel architecture that maintains low activation memory and minimal vision tokens even with high-resolution inputs. Key features include:
On the OmniDocBench benchmark, DeepSeek-OCR achieves remarkable efficiency:
DeepSeek-OCR demonstrates exceptional real-world performance, capable of generating training data for LLMs and VLMs at an unprecedented scale:
Current open-source vision-language models (VLMs) employ three main types of vision encoders, each with distinct advantages and limitations:
DeepEncoder addresses these limitations by combining the best aspects of each approach while minimizing their drawbacks, achieving a balance between memory efficiency, token count, and processing capability.
DeepEncoder is designed to support multiple resolutions efficiently, enabling it to process documents of varying sizes and complexities without sacrificing performance or requiring excessive computational resources.
The decoder component utilizes DeepSeek3B-MoE-A570M, a mixture-of-experts architecture that provides efficient inference while maintaining high accuracy. This design enables the model to specialize in different aspects of OCR tasks while sharing knowledge across experts.
Discover more useful AI tools to boost your productivity
Clone any voice with AI technology. Create personalized voice models for text-to-speech applications.
Make your photos talk with AI. Upload a photo and add voice to create amazing talking videos.
Make animal photos talk with AI. Create fun talking animal videos with realistic mouth movements.