OCR / Document AI¶
Extracting text from images. In 2026 this subtask is bifurcating fast: classical OCR still wins on clean printed pages at scale, but VLMs have eaten most of the messy real-world work.
The field has three distinct use cases with different picks. Don't pick one tool and use it for all three.
Recommended picks¶
| Use case | Pick | When to use |
|---|---|---|
| Document parsing SOTA (open) | PaddleOCR-VL-1.5 (Baidu, 0.9B VLM) | 94.5% on OmniDocBench v1.5. Handles skew, warp, illumination, screen photos. 111 languages. |
| Messy real-world (handwriting, weird contexts) | A flagship VLM (Gemini 2.5, GPT-5, Claude 4) | When you want max accuracy and can afford per-page cost. Often more robust than traditional OCR on unstructured inputs. |
| Clean printed documents at scale | PP-OCRv5 (lightweight PaddleOCR) | Free, 1st on avg 1-edit-distance across OmniDocBench, beats GOT-OCR-0.5B / Qwen2.5-VL-72B at a fraction of size. |
| English-only + local + fast | Tesseract 5 | Zero-dependency batch processing of clean text. |
| Structured document understanding | Donut or LayoutLMv3 | Form key-value extraction with a defined schema. Training-required. |
[!WARNING] LayoutLMv3 is CC-BY-NC-SA-4.0 — non-commercial only. Microsoft released the weights under a non-commercial share-alike license. If you need LayoutLM-style structured extraction in a commercial product, use Donut (MIT), Pix2Struct (Apache-2.0), or a VLM with function calling instead. PaddleOCR and its VL variants are Apache-2.0 and commercial-safe.
The big shift (2024 → 2026)¶
Until 2023, OCR meant Tesseract/PaddleOCR for clean, or a dedicated pipeline (Donut, LayoutLM) for forms. Post-processing the model's output was where the engineering time went.
2024: vision-language models (GPT-4V, Gemini, Claude, Qwen-VL) became competitive with — and on messy inputs, often more robust than — classical OCR.
2025–2026: the two worlds merged. The current SOTA on OmniDocBench v1.5 (94.5%) is PaddleOCR-VL-1.5, a 0.9B VLM from Baidu — smaller than the classical pipelines, faster than flagship VLMs, and trained specifically for document parsing. For the first time there's a single open model that wins on both accuracy and deployability for documents.
Where each tier still wins: - PaddleOCR-VL-1.5 — the new default for document parsing (forms, receipts, PDFs, multilingual, 111 languages). - Flagship VLMs (Gemini 2.5, GPT-5, Claude 4) — for inputs with scene-understanding requirements beyond OCR (reason about what the document means, not just what text it contains). - PP-OCRv5 — scale, cost, and latency: ranks #1 on 1-edit distance on OmniDocBench and beats bigger VLMs at a fraction of the compute. - Tesseract 5 — offline, zero-network, English-only Latin script.
Why PaddleOCR-VL-1.5 is the default¶
Baidu's 2026 document-understanding model. 0.9B parameters (small for a VLM), trained specifically on document tasks. 111 languages. Handles PP-DocLayoutV3-flagged "tough scenarios": skew, warping, scanning artifacts, illumination variation, photos-of-screens. Adds Seal Recognition and Text Spotting.
Ships as a Hugging Face checkpoint: PaddlePaddle/PaddleOCR-VL-1.5. Integrates with vLLM for production serving.
When to pick something else¶
- You need offline, zero-network, low footprint → Tesseract 5. Still the champ for "just works, no deps, English-Latin-scripts."
- You need a specific key-value schema extracted (invoice number, amount, date) → Donut (document understanding transformer) or LayoutLMv3. Requires fine-tuning on labeled documents.
- Math / formulas → pix2tex or Nougat for academic papers.
- License plates, specific domain → YOLO + a small classifier is often better than a general OCR.
- Browser / JS-side → Tesseract.js. Slow but runs client-side.
- Indian scripts specifically → PaddleOCR has reasonable Indic support; EasyOCR covers Devanagari decently; Bhashini (India's govt NLP platform) has dedicated models for regional scripts.
The three questions to narrow¶
- Is the document clean or messy? Clean → PaddleOCR/Tesseract. Messy → VLM.
- Do you need a schema extracted, or just raw text? Raw → PaddleOCR. Schema → Donut/LayoutLM/VLM.
- Are you at scale (>100k pages/month)? Yes → classical OCR (cost). No → VLM is fine.
The Dump¶
Classical OCR¶
- Tesseract 5 (Google / open source) — the grandfather of open-source OCR. Still viable for clean Latin-script text. Minimal deps.
- PaddleOCR (Baidu) — multilingual. PP-OCRv3, v4, v5 lineage.
- PP-OCRv5 (Baidu, 2025) — lightweight; ranks 1st on OmniDocBench by 1-edit-distance. Beats GOT-OCR-0.5B, RolmOCR-7B, Qwen2.5-VL-72B, InternVL3-78B, Gemini 2.5 Pro at a fraction of size.
- EasyOCR — Python-friendly wrapper around PyTorch OCR models. Easier to start with than PaddleOCR but accuracy is slightly behind.
- MMOCR (OpenMMLab) — research-friendly OCR framework. Use if you're training custom models.
- TrOCR (Microsoft) — transformer-based, strong on handwritten.
Document-AI VLMs (the 2025–2026 wave)¶
- PaddleOCR-VL-1.5 (Baidu, 2025–2026) — 0.9B VLM, 94.5% OmniDocBench v1.5 SOTA. Current open document-parsing leader.
- GOT-OCR 2.0 (0.5B) — compact document VLM. Solid baseline.
- DeepSeek-OCR 2 — 2026 entrant, competitive on OmniDocBench.
- GLM-OCR — Zhipu's OCR VLM, competitive in Chinese + multilingual.
- RolmOCR (7B) — larger VLM OCR entrant.
Commercial APIs¶
- Google Cloud Vision Text Detection — the reference commercial OCR. Strong on most inputs.
- AWS Textract — specializes in forms + tables. Strong on clean documents.
- Azure Document Intelligence (formerly Form Recognizer) — closest competitor to Textract.
- Mindee — dev-friendly document API, good key-value extraction.
Document understanding (OCR + layout)¶
- Donut (Naver, 2022) — no separate OCR step, directly trains image → structured output. Fine-tune on your docs.
- LayoutLMv3 (Microsoft, 2022) — pre-trained on document layouts; strong for form understanding. License: CC-BY-NC-SA-4.0 — non-commercial only.
- Pix2Struct (Google, 2023) — same family, different architecture.
- Nougat (Meta, 2023) — academic papers specifically.
- Marker (VikParuchuri) — PDF → markdown, uses a stack of OCR + layout models.
VLM-based OCR (modern messy default)¶
- GPT-4o / GPT-4-turbo-vision — OpenAI. Strong on messy inputs. Expensive.
- Gemini 2.5 Flash / Pro — Google. Competitive with GPT-4, often cheaper.
- Claude 3.5/4 Sonnet — Anthropic. Strong, strict about factuality.
- Qwen2-VL-72B — strongest open VLM at this scale. Self-hostable.
- InternVL2 — another open VLM competitor.
Specialized¶
- pix2tex / LaTeX-OCR — math equations to LaTeX.
- paddledetection + custom CRNN — train your own if you have labeled data in a specific domain.
- CRAFT — text detection only, strong at finding text regions.
Graveyard¶
- OCR as a pure-CNN pipeline — retired as transformers + end-to-end training became mainstream around 2021.
- Tesseract 3/4 as a default — use Tesseract 5 (LSTM engine) or something modern.
- Proprietary ABBYY as the default open comparison — no longer necessary; open options caught up.
Last reviewed¶
2026-04-22.