LiteParse: Fast Local PDF Parsing with OCR and Bounding Boxes

TLDR

LiteParse is a local-first document parser for PDFs, Office files, and images.
It extracts text with layout, bounding boxes, OCR, and page screenshots.
It runs entirely on your machine with no cloud dependency, no LLMs, and no API keys.
You can use it from the CLI, TypeScript, or Python.
It is a good fit for RAG pipelines, coding agents, and local document workflows.

If you want clean text and structure without sending files to a cloud parser, LiteParse is worth a look.

What LiteParse Is

LiteParse is the open-source, local parser from LlamaIndex. The docs describe it as fast local PDF parsing with spatial text parsing, OCR, and bounding boxes.

That is the core value:

preserve page layout
keep bounding boxes for downstream processing
work offline
stay lightweight
avoid cloud calls for routine parsing

Official links:

What It Does Well

LiteParse is built for document workflows where plain text extraction is not enough.

Spatial text parsing

It keeps text tied to its position on the page, which matters when you care about tables, columns, headers, or visual grouping.

OCR support

It can OCR scanned documents with built-in Tesseract.js, or you can plug in your own OCR server.

Structured output

LiteParse can return:

plain text
JSON
bounding boxes
page screenshots

Multi-format support

The docs call out support for:

PDF
DOCX
XLSX
PPTX
PNG
JPG
and other formats via automatic conversion

Local execution

Everything runs on your machine. That makes it a better fit than cloud parsers when privacy, latency, or offline work matters.

Why It Stands Out

The big differentiator is not just that LiteParse is open source.

It is that LiteParse is intentionally narrow and fast:

it is not trying to be a full document intelligence platform
it is not bundling proprietary LLM features
it is not forcing a cloud workflow
it is optimized for quick parsing and downstream use

That makes it a good default when you need a parser that is easy to embed into tools, scripts, or agent workflows.

Quick Start

The docs keep installation simple.

Global CLI install

npm i -g @llamaindex/liteparse

On macOS and Linux, you can also install via Homebrew:

brew tap run-llama/liteparse
brew install llamaindex-liteparse

Parse a document

lit parse document.pdf
lit parse document.pdf -o output.txt
lit parse document.pdf --format json -o output.json
lit parse document.pdf --target-pages "1-5,10,15-20"

Batch parse a directory

lit batch-parse ./pdfs ./outputs

Generate screenshots

lit screenshot document.pdf -o ./screenshots

Library Use

LiteParse is also available as a library.

TypeScript

npm install @llamaindex/liteparse

import { LiteParse } from "@llamaindex/liteparse";

const parser = new LiteParse({ ocrEnabled: true });
const result = await parser.parse("document.pdf");

console.log(result.text);

Python

Python support is covered in the official library usage guide. The docs position LiteParse as usable from TypeScript, Python, or the CLI depending on your stack.

Best Use Cases

LiteParse makes the most sense if you:

build RAG pipelines
work with local or sensitive documents
need OCR without shipping files to a cloud service
want screenshots for agent workflows
prefer CLI or library integration over a web UI

Tradeoffs

LiteParse is intentionally not the heavy-duty cloud answer for every document problem.

If you are dealing with:

dense tables
multi-column layouts
charts
handwriting
messy scanned PDFs at scale

the project’s docs point out that LlamaParse may be the better fit.

Final Take

LiteParse is a strong example of a tool doing one thing well: fast local parsing with enough structure to be useful downstream.

If you want a parser that feels practical instead of platform-heavy, this is a good one to keep around.