AI Is Starving for PDFs

Your AI Is Starving For PDFs - by Michael Kotlikov

Michael's Substack

SubscribeSign in

Your AI Is Starving For PDFs Karpathy wants images in. Anthropic wants HTML out. One old format does both.

Michael Kotlikov May 17, 2026

Two recent AI infrastructure arguments point to the same problem: plain text loses information that matters. Andre Karpathy’s version came through DeepSeek’s OCR paper, which showed that rendering a page as an image could preserve the original with high fidelity while using far fewer tokens than raw text. Karpathy’s takeaway was not “better OCR.” It was more radical: maybe text tokens are wasteful historical baggage, and the image is a better memory substrate. Thariq Shihipar, engineering lead for Claude Code at Anthropic, made the opposite-end argument in The Unreasonable Effectiveness of HTML. For agent outputs — plans, reports, code reviews — markdown is too thin. HTML carries layout, hierarchy, emphasis, and structure that plain text cannot. Different ends of the pipe. Same complaint: flat text throws away meaning. The format that solves both already exists. Thanks for reading Michael's Substack! Subscribe for free to receive new posts and support my work.

The Format Is PDF

Yes, PDF. The thing your bank sends you. The thing governments, lawyers, accountants, academics, and companies have standardized on for decades. The format running global civilization while software people made fun of it. Its advantage is not just that it supports images. It preserves the author’s arrangement. A document is not only words. The diagram next to the paragraph is information. The caption under the figure is information. The number in a table cell under a particular header is information. Convert that document to markdown, scrape it into text, chunk it into 512-token fragments, and you destroy much of what made it intelligible. Then you hand the wreckage to a model and act surprised when it hallucinates the thing your pipeline deleted. A PDF page is closer to source truth. The model sees what the human sees: text, layout, tables, captions, images, hierarchy. That is exactly what Karpathy wants on input and what Shihipar wants on output. A Page Is the Natural Chunk

RAG systems spent years debating chunk sizes: 256 tokens, 512 tokens, semantic windows, overlapping windows. But the natural unit already exists. It is a page. Humans organized knowledge into pages because a page is roughly the amount a person can absorb and reason about at once. Modern frontier models now accept PDFs natively, which means we no longer need to flatten documents into fragments and hope the structure survives. The page is the chunk. The document is the source. The vector database should be the index, not the payload. Keep the Vector DB. Change What It Points To.

Semantic search is still useful. Use embeddings to find relevant passages across a corpus. But do not feed the retrieved text fragment to the model as if it were the document. Feed the page. Each embedded chunk should carry metadata: document ID, page number, coordinates if needed. Search finds the match; the system resolves it back to the original page; the model gets the actual PDF page, untouched. The vector DB finds the evidence. The PDF preserves the evidence. Conversion Already Works

The operational case is boring, which is why it matters. Word, PowerPoint, Excel, Pages, Google Docs, emails, screenshots, scans, web pages, and images all export or print cleanly to PDF. Every operating system knows how to make one. Every device knows how to open one. Markdown has no comparable universal converter. Try converting a complex Word document with charts, tables, footnotes, and captions into clean markdown. Something will vanish, flatten, or land in the wrong place. PDF is not elegant. It is universal. The Objections Are Real, but Smaller Than They Look

PDFs can be heavier than markdown. But DeepSeek’s result points in the other direction: a rendered page can be a compressed representation of its own text, not an inflated one. And the old reason markdown won — tiny context windows and token rationing — is fading. PDFs are not hand-editable. Fine. You edit the source document and export again. You do not edit memory; you edit source. PDF diffs are ugly. True. So are diffs for every binary format. Version the artifact and let the model compare versions when needed. Scanned faxes are messy. So is the real world. That is exactly why a format that preserves visual context matters. The Point

Karpathy and Shihipar are making the same argument from opposite directions. Input should preserve the page. Output should preserve structure. Plain text is too thin for both. The principle is simple: distance from the source is the enemy. Every transformation between author and model is a place where truth leaks out. PDF minimizes that distance. It preserves layout, survives transmission, renders everywhere, and gives models the same unit humans already use:...

AI Is Starving for PDFs

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast