LiteParse: Fast Local PDF Parsing with OCR and Bounding Boxes
TLDR
- LiteParse is a local-first document parser for PDFs, Office files, and images.
- It extracts text with layout, bounding boxes, OCR, and page screenshots.
- It runs entirely on your machine with no cloud dependency, no LLMs, and no API keys.
- You can use it from the CLI, TypeScript, or Python.
- It is a good fit for RAG pipelines, coding agents, and local document workflows.
If you want clean text and structure without sending files to a cloud parser, LiteParse is worth a look.
What LiteParse Is
LiteParse is the open-source, local parser from LlamaIndex. The docs describe it as fast local PDF parsing with spatial text parsing, OCR, and bounding boxes.
That is the core value:
- preserve page layout
- keep bounding boxes for downstream processing
- work offline
- stay lightweight
- avoid cloud calls for routine parsing
Official links:
What It Does Well
LiteParse is built for document workflows where plain text extraction is not enough.
Spatial text parsing
It keeps text tied to its position on the page, which matters when you care about tables, columns, headers, or visual grouping.
OCR support
It can OCR scanned documents with built-in Tesseract.js, or you can plug in your own OCR server.
Structured output
LiteParse can return:
- plain text
- JSON
- bounding boxes
- page screenshots
Multi-format support
The docs call out support for:
- DOCX
- XLSX
- PPTX
- PNG
- JPG
- and other formats via automatic conversion
Local execution
Everything runs on your machine. That makes it a better fit than cloud parsers when privacy, latency, or offline work matters.
Why It Stands Out
The big differentiator is not just that LiteParse is open source.
It is that LiteParse is intentionally narrow and fast:
- it is not trying to be a full document intelligence platform
- it is not bundling proprietary LLM features
- it is not forcing a cloud workflow
- it is optimized for quick parsing and downstream use
That makes it a good default when you need a parser that is easy to embed into tools, scripts, or agent workflows.
Quick Start
The docs keep installation simple.
Global CLI install
npm i -g @llamaindex/liteparseOn macOS and Linux, you can also install via Homebrew:
brew tap run-llama/liteparse
brew install llamaindex-liteparseParse a document
lit parse document.pdf
lit parse document.pdf -o output.txt
lit parse document.pdf --format json -o output.json
lit parse document.pdf --target-pages "1-5,10,15-20"Batch parse a directory
lit batch-parse ./pdfs ./outputsGenerate screenshots
lit screenshot document.pdf -o ./screenshotsLibrary Use
LiteParse is also available as a library.
TypeScript
npm install @llamaindex/liteparseimport { LiteParse } from "@llamaindex/liteparse";
const parser = new LiteParse({ ocrEnabled: true });
const result = await parser.parse("document.pdf");
console.log(result.text);Python
Python support is covered in the official library usage guide. The docs position LiteParse as usable from TypeScript, Python, or the CLI depending on your stack.
Best Use Cases
LiteParse makes the most sense if you:
- build RAG pipelines
- work with local or sensitive documents
- need OCR without shipping files to a cloud service
- want screenshots for agent workflows
- prefer CLI or library integration over a web UI
Tradeoffs
LiteParse is intentionally not the heavy-duty cloud answer for every document problem.
If you are dealing with:
- dense tables
- multi-column layouts
- charts
- handwriting
- messy scanned PDFs at scale
the project’s docs point out that LlamaParse may be the better fit.
Final Take
LiteParse is a strong example of a tool doing one thing well: fast local parsing with enough structure to be useful downstream.
If you want a parser that feels practical instead of platform-heavy, this is a good one to keep around.