Markdown Comes to Liteparse

pierre2 pts1 comments

Markdown Comes to LiteParse

Grep, Embeddings, or Both? Join us for a live webinar on June 30th to see the retrieval harness we built for agents.<br>Product

LlamaParse<br>Industry-leading document processing

Parse

Extract

Index

Open Source<br>OSS repos trusted by millions of developers

LiteParse

Workflows

LlamaIndex

Solutions

Persona

Engineering & R&D Accelerate product development

Administrative Operations Streamline business processes

Financial Analysts Build AI-powered financial models

Industry

Insurance Automate claims and underwriting

Finance Power financial research

Manufacturing Optimize system uptime

Healthcare & Pharma Accelerate clinical research

Use cases

Financial Due Diligence Speed up compliance reviews

Invoice Processing Automate manual review

Technical Document Search Find answers in complex docs

Customer Support Instant, accurate responses

Docs<br>Resources

-->

Customer stories See real-world success stories

How Jeppesen (a Boeing Company) Saves ~2,000 Engineering Hours with Unified Chat Framework

Company

Persona -->

About us Our mission and story

Careers Join our growing team

-->

Brand Logos and brand guidelines

View open roles at LlamaIndex

Blog<br>Pricing

How Does it Work?

50

How Does it Work?<br>Measuring Markdown Performance<br>Benchmark Results<br>ParseBench<br>opendataloader-bench<br>olmOCR-bench<br>Speed Tests<br>Licensing & Portability<br>A Note on v2.1 Scope<br>Try it Today!

Content

How Does it Work?<br>Measuring Markdown Performance<br>Benchmark Results<br>ParseBench<br>opendataloader-bench<br>olmOCR-bench<br>Speed Tests<br>Licensing & Portability<br>A Note on v2.1 Scope<br>Try it Today!<br>Follow us on

50

A few weeks ago, we launched LiteParse 2.0 as the fastest tool for converting PDFs to text. However, a few questions kept coming up again and again: Where are the benchmarks? Does it output markdown?

LiteParse v2.1 answers this by delivering the fastest open-source, model-free, pdf-to-markdown pipeline. We measured our performance on 3 standard benchmarks and achieved top overall scores on all three when measured against model-free approaches: opendataloader-bench at 0.875, olmOCR-bench at 0.391, and ParseBench at 0.3279.

Visit the demo site (running in-browser with WASM) or install the latest version today!

$ pip install liteparse<br>$ lit parse doc.pdf --format markdown

from liteparse import LiteParse

lp = LiteParse(output_format="markdown")<br>result = lp.parse("doc.pdf")<br>print(result.text)

How Does it Work?<br>Building a heuristic pipeline for markdown essentially boils down into two parts: signals you can detect, and the types of output elements that listen to those signals. Similar to any machine-learning model, this essentially boils down to inputs, weights, and activations!

PDFs carry a ton of data: font family, font size, text-location, and more. All of these are then treated as input signals to classify text into specific markdown elements like paragraphs, tables, lists, and headers.

LiteParse uses a custom PDFium fork to capture as much signal as possible, and then combines that with signals from our existing grid-projection algorithm, to deliver the best markdown output we can deliver with a purely heuristic rule-based approach.

As time goes on, we expect this mode to get even better. There’s an extremely long tail of PDFs that we can adapt to over time, and time is the best thing for making this mode better.

Measuring Markdown Performance<br>It turns out not only is markdown a highly requested output option, it's also very hard to benchmark PDF parsing tools without it.

All existing benchmarks (ParseBench, olmOCR-bench, opendataloader-bench) are strongly fit to measuring markdown. By building this markdown pipeline, we were able to deliver an entirely new output mode while also being able to measure and improve our overall extraction quality.

In the spirit of “Lite”-ness, we built the markdown mode in LiteParse to be as light and fast as possible. This approach prioritizes speed, but also has to accept an upper-bound on accuracy (we aren’t going to do better than LlamaParse with this approach).

In order to compare fairly, we scoped our comparisons to open-source tools that do not leverage larger AI models for parsing. This means OCR and other model integrations are disabled when benchmarking.

Benchmark Results<br>ParseBench<br>We’ve written a lot about ParseBench already. 2000+ documents measured across 5 key metrics that end-users actually care about. These are intentionally hard documents, so without larger AI models, these scores are actually quite impressive.

LiteParse leads Overall. The Charts and Visual Grounding columns are effectively noise for every model-free tool here. ParseBench scores charts (and parts of its layout/visual-grounding metrics) by comparing structured data extracted from the chart, which fundamentally requires an ML model to recover. A heuristic engine has nothing to emit there, so all model-free tools cluster near zero. We're reporting those columns for completeness...

markdown liteparse bench parsebench model output

Related Articles