Document OCR & Parsing: Docling, dots.ocr, and Alternatives

⬅️ Back to Tools

Feeding documents into an AI pipeline should be simple, but parsing PDFs with tables, formulas, and multi-column layouts is a notorious bottleneck. This guide covers the best open-source tools, from multi-format parsers to state-of-the-art OCR models.


Docling: Multi-Format Document Parser

Docling solves the frustration of document parsing by converting diverse formats into a unified, expressive representation. Developed by IBM, it handles formats most tools can’t touch.

GitHub: github.com/DS4SD/docling · License: MIT

Key Features

Format Support: PDF, DOCX, PPTX, XLSX, HTML, Markdown, AsciiDoc, and images (PNG, TIFF, JPEG).

Advanced Understanding:

  • Layout analysis respecting reading order and multi-column layouts
  • Accurate table structure reconstruction
  • Formula and code recognition
  • OCR for scanned PDFs and images

Seamless Integration: LangChain, LlamaIndex, Crew AI, Haystack, MCP Server.

Get Started

pip install docling
from docling.document_converter import DocumentConverter

source = "https://arxiv.org/pdf/2408.09869"
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())
docling https://arxiv.org/pdf/2206.01062

dots.ocr / dots.mocr: AI-Powered Multilingual OCR

dots.ocr is a state-of-the-art 1.7B vision-language model for document OCR across 100+ languages. It unifies layout detection and content recognition in a single model, outperforming GPT-4o and Gemini on document benchmarks. Now rebranded as dots.mocr.

GitHub: github.com/rednote-hilab/dots.ocr · Hugging Face: rednote-hilab/dots.ocr · License: MIT

Key Features

  • 100+ languages including English, Chinese, Arabic, Hindi
  • Handles text, tables, mathematical formulas, and reading order
  • 10x faster than traditional OCR tools
  • Outperforms GPT-4o, Gemini, Marker on benchmarks

How to Use

Web UI (easiest): dotsocr.net, drop a file, get markdown.

Local with vLLM (v0.11.0+ has official integration):

docker run --gpus all -p 8000:8000 vllm/vllm-openai:v0.11.0
python3 dots_mocr/parser.py demo/demo_image1.jpg
python3 dots_mocr/parser.py demo/demo_pdf1.pdf --num_thread 64

Local with Transformers:

git clone https://github.com/rednote-hilab/dots.mocr.git
cd dots.mocr && pip install -r requirements.txt
python3 dots_mocr/parser.py demo/demo_image1.jpg --use_hf true

Performance (OmniDocBench)

Metricdots.ocrvs GPT-4ovs Marker
Overall Edit Distance0.125--
Text Recognition Error0.032-60% better
Table TEDS88.6%46% better-
Reading Order Error0.040BestBest

OCR Alternatives Comparison

ToolLicenseLanguagesApproachGitHub
TesseractApache 2.0100+LSTM-basedtesseract-ocr/tesseract
SuryaGPL 3.090+TransformerVikParuchuri/surya
PaddleOCRApache 2.080+Deep LearningPaddlePaddle/PaddleOCR
EasyOCRApache 2.080+CNNJaidedAI/EasyOCR
NougatMITAcademicTransformerfacebookresearch/nougat

When to Use Which

  • Docling: Best for structured document pipelines, handles PDF, DOCX, PPTX, XLSX out of the box with Gen AI integrations
  • dots.ocr: Best OCR accuracy on complex multilingual documents, SOTA benchmarks, 100+ languages, free web UI
  • Tesseract: Mature, fast on CPU, 100+ languages, works fully offline
  • Surya: Modern architecture with layout analysis, best for challenging scripts and handwritten text
  • PaddleOCR: Strongest for Chinese, Japanese, Korean
  • EasyOCR: Easiest Python API for quick prototyping

Crepi il lupo! 🐺