OCR'ing 100k pages with open-source VLMs on Modal

OCR’ing 100,000 pages with open-source VLMs on Modal

We wanted to answer a simple question: what does it actually take to OCR a large document corpus using open-source vision-language models?

We picked a workload of 100,000 pages. That number was mostly arbitrary, but it was large enough to be interesting, and large enough that we had to figure out how to make it not a financially ruinous endeavor.

tl;dr - three things that surprised us:

Self-hosting open-source OCR is cheaper and far less painful than you might think. Our full 100k-page run with dots.ocr-1.5 finished in 56.5 minutes for $223 — about $2.27 per 1,000 pages. Staging weights, renting GPUs, and standing up the serving stack took hours, not days; Modal did most of the heavy lifting, and we never touched a CUDA driver or a container registry.

The cheapest GPU per second may be the slowest to finish and unacceptable at scale. An L4 can land in the same dollars-per-page band as an H100 while taking 5–6× longer to drain the queue. Your SLA matters as much as the spreadsheet.

Benchmark your workload, not the leaderboard. Two public-ranking favorites lost once we measured them on the shape of our data.

This post is for engineers, technical founders, and anyone who has mostly lived on hosted APIs from the big model providers and wants a concrete sense of what self-hosting actually looks like in practice.

Side-by-side viewer to compare every model's OCR output against the rendered original. Check it out →

Why open-source models?

Control, cost, and quality that is good enough for the vast majority of document workflows. We focused on VLMs (rather than text-only OCR) because real documents are messy: handwritten forms, scanned PDFs, awkward multi-column layouts, and text baked into images all break naive text extraction.

Running your own model turns a fixed price sheet into a set of knobs. The weights, serving engine, batch strategy, GPU type, and quantization are all yours to tune. You stop paying a per-token markup and start paying for GPU-seconds, which flips the economics once volume is non-trivial.

There’s a strategic angle too, and it’s the one that’s easy to underweight until it bites you. If a hosted API is the foundation of your product, you don’t own the cost curve, the model lifecycle, or the deployment path. Prices move, default models change underneath you, and the exact model you tuned around can be deprecated on the vendor’s schedule, not yours. Self-hosting trades convenience for control over the variables that matter in production.

The three axes that matter

Everything below comes back to three measurements:

Throughput — pages per second per GPU. Sets wall-clock time and how many GPUs you need.

Cost — dollars per page. Sets whether the project is viable at scale.

Quality — does the model actually read the page right? A page OCR’d cheaply is worthless if it silently garbles the text. We treat this qualitatively (more on why in Model Selection) and ship a viewer so you can check our work.

We’ll take them roughly in the order they mattered to us: which model, then how fast and how cheap, then how close to correct, and finally how all of that stacks up against the proprietary APIs.

Who we are

We’re the founders of Redspring, an AI and product-development consultancy. We build applied ML systems, including large-scale document and data processing, so this was equal parts client-relevant and an excuse to satisfy our own curiosity.

Overview

Model and engine selection

Modal basics

Benchmark methodology

Results

What we learned

Conclusion

Model Selection

We picked models based largely on how they performed on OCR-oriented public benchmarks, including olmOCR-bench, OCRBench v2, and the OCR-related leaderboards collected by LLM-Stats (OmniDocBench 1.5, OCRBench). Because these benchmarks cover different tasks and not every model appears on every leaderboard, we treated them as directional rather than definitive.

Here’s the rough ranking we used to decide what to test:

Composite rankModelBenchmark evidence used1datalab-to/chandra-ocr-2olmOCR-bench2rednote-dots-ocr-community/dots.ocr-1.5olmOCR-bench3NVIDIA Nemotron Nano V2 VLOCRBench v24Qwen3.5-122B-A10BLLM-Stats OCRBench5Qwen3.5-35B-A3BLLM-Stats OCRBench<br>This is not meant to be a universal ranking. It ignores cost, latency, serving complexity, and the specifics of the workload. It was just a sensible starting point.

Why we didn’t compute a single accuracy number

The obvious move here is to report CER/WER against ground truth and crown a winner. We deliberately didn’t, for two reasons. First, that work already exists and is done well — the public benchmarks above are the right place for a rigorous, apples-to-apples score, and we’d only be reproducing them worse. Second, a single aggregate number hides exactly the failures that matter in production. A model can post an excellent average CER while occasionally and confidently rewriting a clause, and a quiet hallucination...

OCR'ing 100k pages with open-source VLMs on Modal

Related Articles

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

Britain Became as Poor as Mississippi