Show HN: A PDF analysis tool for parser and representation differences

PDF Forensics Scanner — 47 Engines, AI Report | PQ PDF

NISTListed

New Official Recognition · May 2026

PQ PDF Forensic Scanner is now listed in the NIST Forensic Tool Catalog

One of only 2 tools in the U.S. government’s authoritative Open Source Intelligence forensics registry — alongside CAINE.

View Listing →

🛡️

PDF Forensics Scanner

Full document forensics, not just malware. Findings are graded across four axes — exploit (dynamic behavioural sandbox, YARA, ClamAV, CVE patterns, polyglot & embedded-executable detection, JS AST deobfuscation), document-integrity tampering (signature forensics, shadow documents, DocMDP/FieldMDP, trailer-chain & XRef integrity), content-integrity / semantic-determinism (value/appearance V/AP divergence, font glyph remapping, OCR text-layer poisoning, /Alt & /ActualText AI-prompt injection, reading-order ambiguity — where a file shows one thing to a human and another to a parser/LLM), and neutral structure (including PDF 2.0 / ISO 32000-2 constructs — Associated Files /AF, unencrypted-wrapper / encrypted-payload detection, document-part hierarchy /DPartRoot, tagged-PDF namespaces) — backed by LightGBM + SHAP ML anomaly detection, six-parser differential comparison, fully offline threat intelligence (URLhaus · MalwareBazaar · ThreatFox · FeodoTracker · OpenPhish — 6.4M+ indicators, zero external API calls), and TLSH campaign attribution. 47 independent forensic passes with MITRE ATT&CK mapping, 24-tab browser including 🤖 AI Forensic Report (Qwen 2.5 1.5B — self-hosted, zero third-party AI), and raw forensics view.

 Full scanner description, methodology & comparisons →  The research — Semantic Nondeterminism, proven across 24,824 PDFs →  Case study: all 16,971 DOJ Epstein PDFs →

1.3→1.5/JS0 CVEV/APSHA‑256/ObjStmxref ✓T1027CLEAN

25,426 PDFs put through the scanner to date

Multi-Modal Forensic Architecture 47 orthogonal passes · structural · behavioural · XFA FormCalc · action graph · OCG cloaking · invisible text · trailer chain · codec exploits · entropy topology · image stego · compliance fraud · JS emulation · font CharString · XRef integrity · provenance · probabilistic · comparative · semantic · threat intelligence 47 engines

Structure Validator

Inspects PDF header position, counts %%EOF markers (exploit PDFs often carry multiple), audits cross-reference table depth, linearisation flags, and excessive filter chains used for obfuscation. Also detects linearized first-page object overrides — incremental updates that re-define an existing Page 1 object (same OID) to inject JavaScript or actions. Renderers that fast-path Page 1 via the linearization hint table never re-evaluate the override on initial render, making the injected content invisible until the page is refreshed. File-level polyglot detection — checks whether a recognised format magic signature (JPEG FF D8 FF, ZIP PK\x03\x04, PNG, GIF, Gzip, OLE, RIFF) appears in the bytes before the %PDF- header; ISO 32000 §7.5.2 NOTE 1 permits arbitrary pre-header bytes for legitimate reasons (e.g., PostScript DSC prefix), but a recognised format signature at byte 0 is characteristic of deliberate polyglot construction to bypass format-based email security gateways that classify files by their first bytes. PDF 2.0 (ISO 32000-2) structures — records the /DPartRoot document-part hierarchy (§14.12, PDF/VT) and tagged-PDF /Namespaces (§14.7.4); both are neutral structural features, the latter part of the accessibility/semantic layer that reality-drift attacks target.

Pattern Scanner

45+ byte-level signatures: /JavaScript /Launch /OpenAction /EmbeddedFile /JBIG2Decode /XFA /RichMedia, NOP sleds (%u9090 %u4141), heapspray fills, and dangerous JS APIs: eval() unescape() collab.getIcon() util.printf().

Stream Inspector

Decompresses every FlateDecode stream via PyMuPDF and re-scans the raw content — catching JavaScript and shellcode hidden inside compressed objects that raw-byte scanners miss entirely. Calculates Shannon entropy per stream; values above 7.2 bits flag encrypted or packed payloads.

Object Analyzer

Walks the full cross-reference object graph, resolving indirect references and checking every object dictionary for dangerous action-type combinations (/S /Launch, /S /JavaScript, /RichMedia, /XFA). Reports exact xref numbers of suspicious objects.

URL Extractor

Extracts all HTTP/HTTPS URLs from raw bytes and decompressed streams, de-duplicates, and lists them so you can assess every domain the PDF attempts to contact — phoning home, tracking pixels, and C2 beaconing.

Metadata Analyzer

Inspects Producer and Creator fields for known exploit-tool strings (Metasploit, Canvas, Core Impact), flags missing metadata — a hallmark of crafted exploits — and scans XMP streams for embedded script references.

Font Analyzer

Checks every font object for /JBIG2Decode usage — the codec exploited in CVE-2009-0658 and CVE-2010-0188 — and for abnormally large /Widths arrays used in historic heap-overflow...

Show HN: A PDF analysis tool for parser and representation differences

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs