Markdown Comes to LiteParse
Grep, Embeddings, or Both? Join us for a live webinar on June 30th to see the retrieval harness we built for agents.<br>Product
LlamaParse<br>Industry-leading document processing
Parse
Extract
Index
Open Source<br>OSS repos trusted by millions of developers
LiteParse
Workflows
LlamaIndex
Solutions
Persona
Engineering & R&D Accelerate product development
Administrative Operations Streamline business processes
Financial Analysts Build AI-powered financial models
Industry
Insurance Automate claims and underwriting
Finance Power financial research
Manufacturing Optimize system uptime
Healthcare & Pharma Accelerate clinical research
Use cases
Financial Due Diligence Speed up compliance reviews
Invoice Processing Automate manual review
Technical Document Search Find answers in complex docs
Customer Support Instant, accurate responses
Docs<br>Resources
-->
Customer stories See real-world success stories
How Jeppesen (a Boeing Company) Saves ~2,000 Engineering Hours with Unified Chat Framework
Company
Persona -->
About us Our mission and story
Careers Join our growing team
-->
Brand Logos and brand guidelines
View open roles at LlamaIndex
Blog<br>Pricing
How Does it Work?
50
How Does it Work?<br>Measuring Markdown Performance<br>Benchmark Results<br>ParseBench<br>opendataloader-bench<br>olmOCR-bench<br>Speed Tests<br>Licensing & Portability<br>A Note on v2.1 Scope<br>Try it Today!
Content
How Does it Work?<br>Measuring Markdown Performance<br>Benchmark Results<br>ParseBench<br>opendataloader-bench<br>olmOCR-bench<br>Speed Tests<br>Licensing & Portability<br>A Note on v2.1 Scope<br>Try it Today!<br>Follow us on
50
A few weeks ago, we launched LiteParse 2.0 as the fastest tool for converting PDFs to text. However, a few questions kept coming up again and again: Where are the benchmarks? Does it output markdown?
LiteParse v2.1 answers this by delivering the fastest open-source, model-free, pdf-to-markdown pipeline. We measured our performance on 3 standard benchmarks and achieved top overall scores on all three when measured against model-free approaches: opendataloader-bench at 0.875, olmOCR-bench at 0.391, and ParseBench at 0.3279.
Visit the demo site (running in-browser with WASM) or install the latest version today!
$ pip install liteparse<br>$ lit parse doc.pdf --format markdown
from liteparse import LiteParse
lp = LiteParse(output_format="markdown")<br>result = lp.parse("doc.pdf")<br>print(result.text)
How Does it Work?<br>Building a heuristic pipeline for markdown essentially boils down into two parts: signals you can detect, and the types of output elements that listen to those signals. Similar to any machine-learning model, this essentially boils down to inputs, weights, and activations!
PDFs carry a ton of data: font family, font size, text-location, and more. All of these are then treated as input signals to classify text into specific markdown elements like paragraphs, tables, lists, and headers.
LiteParse uses a custom PDFium fork to capture as much signal as possible, and then combines that with signals from our existing grid-projection algorithm, to deliver the best markdown output we can deliver with a purely heuristic rule-based approach.
As time goes on, we expect this mode to get even better. There’s an extremely long tail of PDFs that we can adapt to over time, and time is the best thing for making this mode better.
Measuring Markdown Performance<br>It turns out not only is markdown a highly requested output option, it's also very hard to benchmark PDF parsing tools without it.
All existing benchmarks (ParseBench, olmOCR-bench, opendataloader-bench) are strongly fit to measuring markdown. By building this markdown pipeline, we were able to deliver an entirely new output mode while also being able to measure and improve our overall extraction quality.
In the spirit of “Lite”-ness, we built the markdown mode in LiteParse to be as light and fast as possible. This approach prioritizes speed, but also has to accept an upper-bound on accuracy (we aren’t going to do better than LlamaParse with this approach).
In order to compare fairly, we scoped our comparisons to open-source tools that do not leverage larger AI models for parsing. This means OCR and other model integrations are disabled when benchmarking.
Benchmark Results<br>ParseBench<br>We’ve written a lot about ParseBench already. 2000+ documents measured across 5 key metrics that end-users actually care about. These are intentionally hard documents, so without larger AI models, these scores are actually quite impressive.
LiteParse leads Overall. The Charts and Visual Grounding columns are effectively noise for every model-free tool here. ParseBench scores charts (and parts of its layout/visual-grounding metrics) by comparing structured data extracted from the chart, which fundamentally requires an ML model to recover. A heuristic engine has nothing to emit there, so all model-free tools cluster near zero. We're reporting those columns for completeness...