Semantic Nondeterminism β PDF Is the Proof | PQ PDF
No ads. No tracking. No data sold. Ever.
Switch tool
π PDF Home
Core Manipulation<br>ποΈ Compress PDF<br>ποΈ Delete Pages<br>π Auto-crop & Deskew<br>π Extract Pages<br>π Flatten PDF<br>π¨ Grayscale / B&W<br>π Merge PDFs<br>π N-up / Imposition<br>π§ Repair PDF<br>π Reorder Pages<br>π Rotate Pages<br>βοΈ Split PDF
Convert<br>π PDF β Word<br>π PDF β Excel<br>π PDF β HTML<br>πΌοΈ PDF β Images<br>π PDF β Markdown<br>π½οΈ PDF β PowerPoint<br>ποΈ PDF β PDF/A<br>π¨οΈ PDF β PDF/X<br>π Excel β PDF<br>π HTML β PDF<br>πΌοΈ Images β PDF<br>π½οΈ PowerPoint β PDF<br>π Word β PDF
Forensics
π¬ PDF Forensics Scanner
π Scanner Description & Guide
ποΈ Office Forensics Scanner<br>π¬ Universal File Forensics<br>π¬ File Fingerprint Comparator<br>π§ PDF Forensics for Outlook
Security Research<br>π§ͺ All Research β<br>π PDF Forensics at Scale<br>ποΈ The Epstein Files, Forensically<br>π PDF Reality Drift<br>πͺ The Illusion of Immutability<br>βοΈ Parser Disagreement<br>π§© PDF Semantic Determinism<br>π€ PDF AI Ingestion Pipelines<br>π PDF Form Security
Security & Privacy<br>π‘οΈ Protect PDF<br>π Unlock PDF<br>β¬ Redact PDF<br>π§ Watermark<br>βοΈ Sign PDF & PAdES
ποΈ Send for E-Signature
π E-Signature Description & Guide
Content & Annotation<br>π· PDF Scanner<br>βΏ Accessibility Checker<br>π¨ Colour Inspector<br>π Compare PDFs
βοΈ Edit PDF
π Editor Description & Guide
π Extract Text<br>π Fill PDF Form<br>π€ Font Inspector<br>π OCR PDF<br>π Outline / Bookmarks<br>βΉοΈ PDF Info<br>π Tables to JSON
Automation<br>βοΈ Workflow Builder
π¬ Security Research<br>Immutable Bytes, Mutable Meaning
A PDF isn't one document β it's a rendering program over an object graph, and the parser,<br>renderer, signature validator, or AI pipeline each decides what it says.<br>This is the single entry point to one argument, built in layers and measured across<br>24,824 real PDFs in three separate corpora.
Document AI assumes semantic determinism. The PDF format never provided it.
Ground truth, retrieval correctness, reproducible evaluation, hallucination control β every one quietly assumes a<br>document means the same thing to every reader. For PDFs that assumption is false, and now that machines do the reading it is a<br>measurable failure , not a curiosity. You don't have to care about PDFs to care about that.
Immutable bytes do not guarantee immutable meaning.
Imagine an LLM trained on a document no human has ever seen . Nothing was hidden and nothing was hacked β the<br>PDF simply held several valid machine-readable realities, the ingestion pipeline picked one, and enshrined it as ground truth.<br>That is not a thought experiment; it is what the studies below measure.
PDF was engineered to guarantee one thing: visual fidelity β that a page looks the<br>same on every screen and printer. It never promised semantic determinism β that<br>every system reading the file extracts the same meaning from it. For thirty years that gap was<br>invisible, because humans read the pixels and the assumption βone parser, one truthβ was never tested.
Machines changed the stakes. RAG knowledge bases, LLM training corpora, compliance pipelines and<br>legal-discovery systems now read the object graph , not the page β and the gap becomes a<br>security and correctness problem: forms signed while their value and their appearance disagree,<br>knowledge bases that silently enshrine the wrong reality, evidence that reads one way to a person<br>and another to a tool. This program names that property, measures how often it occurs, and shows<br>what it enables.
Why this surfaced now
AI ingestion at scale1993 β 2022Humans read the pagethe gap is invisible2023 βMachines read the structurethe gap becomes a risk The flaw isn't new β the readers are. For thirty years a human read the rendered page and the<br>gap between pixel and payload never mattered. Now machines read the structure at scale, and a latent property of the<br>format becomes an active security and correctness problem.
For thirty years we assumed a document had one meaning, because a human read the page.<br>Now machines read the structure β and many PDFs never had a single meaning at all.
One file, as many documents as there are readers
ONE PDFsame bytes Β· one hashvalid signatureπHuman readerthe page as drawnβ¨Text extractionremapped charactersπ€AI / RAG ingestiontext no human seesβSignature validatorreports VALID ββΏAccessibility treea different order The same bytes, the same hash, the same valid signature β handed to five systems that each<br>return a different document . None is malfunctioning; the PDF format guarantees they'll agree on how the<br>page looks, never on what it says.
See the gap in one file
The claim sounds abstract until you watch a single file say two things at once.<br>None of the files below is corrupt. None trips a malware alert. Each is simply<br>a different document depending on who β or what β is reading it.
Exhibit A The character that isn't the glyph
π What the page shows a human
9rendered on screen and in print
β¨ What text extraction returns
1copied, indexed, fed to the model
The font's ToUnicode map points the glyph...