PDFs Don't Have One Meaning: Measuring Semantic Drift Across 24,824 Files

Semantic Nondeterminism — PDF Is the Proof | PQ PDF

No ads. No tracking. No data sold. Ever.

Switch tool

🏠 PDF Home

Core Manipulation 🗜️ Compress PDF 🗑️ Delete Pages 📐 Auto-crop & Deskew 📑 Extract Pages 📋 Flatten PDF 🎨 Grayscale / B&W 📎 Merge PDFs 📋 N-up / Imposition 🔧 Repair PDF 🔀 Reorder Pages 🔄 Rotate Pages ✂️ Split PDF

Convert 📝 PDF → Word 📊 PDF → Excel 🌐 PDF → HTML 🖼️ PDF → Images 📄 PDF → Markdown 📽️ PDF → PowerPoint 🗂️ PDF → PDF/A 🖨️ PDF → PDF/X 📊 Excel → PDF 🌐 HTML → PDF 🖼️ Images → PDF 📽️ PowerPoint → PDF 📝 Word → PDF

Forensics

🔬 PDF Forensics Scanner

📖 Scanner Description & Guide

🗂️ Office Forensics Scanner 🔬 Universal File Forensics 🔬 File Fingerprint Comparator 📧 PDF Forensics for Outlook

Security Research 🧪 All Research → 📊 PDF Forensics at Scale 🗂️ The Epstein Files, Forensically 🌀 PDF Reality Drift 🪞 The Illusion of Immutability ⚖️ Parser Disagreement 🧩 PDF Semantic Determinism 🤖 PDF AI Ingestion Pipelines 📋 PDF Form Security

Security & Privacy 🛡️ Protect PDF 🔓 Unlock PDF ⬛ Redact PDF 💧 Watermark ✍️ Sign PDF & PAdES

🖊️ Send for E-Signature

📖 E-Signature Description & Guide

Content & Annotation 📷 PDF Scanner ♿ Accessibility Checker 🎨 Colour Inspector 🔍 Compare PDFs

✏️ Edit PDF

📖 Editor Description & Guide

📄 Extract Text 📝 Fill PDF Form 🔤 Font Inspector 🔎 OCR PDF 🔖 Outline / Bookmarks ℹ️ PDF Info 📊 Tables to JSON

Automation ⚙️ Workflow Builder

🔬 Security Research Immutable Bytes, Mutable Meaning

A PDF isn't one document — it's a rendering program over an object graph, and the parser, renderer, signature validator, or AI pipeline each decides what it says. This is the single entry point to one argument, built in layers and measured across 24,824 real PDFs in three separate corpora.

Document AI assumes semantic determinism. The PDF format never provided it.

Ground truth, retrieval correctness, reproducible evaluation, hallucination control — every one quietly assumes a document means the same thing to every reader. For PDFs that assumption is false, and now that machines do the reading it is a measurable failure , not a curiosity. You don't have to care about PDFs to care about that.

Immutable bytes do not guarantee immutable meaning.

Imagine an LLM trained on a document no human has ever seen . Nothing was hidden and nothing was hacked — the PDF simply held several valid machine-readable realities, the ingestion pipeline picked one, and enshrined it as ground truth. That is not a thought experiment; it is what the studies below measure.

PDF was engineered to guarantee one thing: visual fidelity — that a page looks the same on every screen and printer. It never promised semantic determinism — that every system reading the file extracts the same meaning from it. For thirty years that gap was invisible, because humans read the pixels and the assumption “one parser, one truth” was never tested.

Machines changed the stakes. RAG knowledge bases, LLM training corpora, compliance pipelines and legal-discovery systems now read the object graph , not the page — and the gap becomes a security and correctness problem: forms signed while their value and their appearance disagree, knowledge bases that silently enshrine the wrong reality, evidence that reads one way to a person and another to a tool. This program names that property, measures how often it occurs, and shows what it enables.

Why this surfaced now

AI ingestion at scale1993 – 2022Humans read the pagethe gap is invisible2023 →Machines read the structurethe gap becomes a risk The flaw isn't new — the readers are. For thirty years a human read the rendered page and the gap between pixel and payload never mattered. Now machines read the structure at scale, and a latent property of the format becomes an active security and correctness problem.

For thirty years we assumed a document had one meaning, because a human read the page. Now machines read the structure — and many PDFs never had a single meaning at all.

One file, as many documents as there are readers

ONE PDFsame bytes · one hashvalid signature👁Human readerthe page as drawn⌨Text extractionremapped characters🤖AI / RAG ingestiontext no human sees✍Signature validatorreports VALID ✓♿Accessibility treea different order The same bytes, the same hash, the same valid signature — handed to five systems that each return a different document . None is malfunctioning; the PDF format guarantees they'll agree on how the page looks, never on what it says.

See the gap in one file

The claim sounds abstract until you watch a single file say two things at once. None of the files below is corrupt. None trips a malware alert. Each is simply a different document depending on who — or what — is reading it.

Exhibit A The character that isn't the glyph

👁 What the page shows a human

9rendered on screen and in print

⌨ What text extraction returns

1copied, indexed, fed to the model

The font's ToUnicode map points the glyph...

PDFs Don't Have One Meaning: Measuring Semantic Drift Across 24,824 Files

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

It's Not Just X. It's Y