PDF Forensics Scanner — 47 Engines, AI Report | PQ PDF
NISTListed
New<br>Official Recognition · May 2026
PQ PDF Forensic Scanner is now listed in the NIST Forensic Tool Catalog
One of only 2 tools in the U.S. government’s authoritative Open Source Intelligence forensics registry — alongside CAINE.
View Listing →
🛡️
PDF Forensics Scanner
Full document forensics, not just malware. Findings are graded across four axes — exploit (dynamic behavioural sandbox, YARA, ClamAV, CVE patterns, polyglot & embedded-executable detection, JS AST deobfuscation), document-integrity tampering (signature forensics, shadow documents, DocMDP/FieldMDP, trailer-chain & XRef integrity), content-integrity / semantic-determinism (value/appearance V/AP divergence, font glyph remapping, OCR text-layer poisoning, /Alt & /ActualText AI-prompt injection, reading-order ambiguity — where a file shows one thing to a human and another to a parser/LLM), and neutral structure (including PDF 2.0 / ISO 32000-2 constructs — Associated Files /AF, unencrypted-wrapper / encrypted-payload detection, document-part hierarchy /DPartRoot, tagged-PDF namespaces) — backed by LightGBM + SHAP ML anomaly detection, six-parser differential comparison, fully offline threat intelligence (URLhaus · MalwareBazaar · ThreatFox · FeodoTracker · OpenPhish — 6.4M+ indicators, zero external API calls), and TLSH campaign attribution. 47 independent forensic passes with MITRE ATT&CK mapping, 24-tab browser including 🤖 AI Forensic Report (Qwen 2.5 1.5B — self-hosted, zero third-party AI), and raw forensics view.
Full scanner description, methodology & comparisons →<br> The research — Semantic Nondeterminism, proven across 24,824 PDFs →<br> Case study: all 16,971 DOJ Epstein PDFs →
1.3→1.5/JS0 CVEV/APSHA‑256/ObjStmxref ✓T1027CLEAN
25,426<br>PDFs put through the scanner to date
Multi-Modal Forensic Architecture<br>47 orthogonal passes · structural · behavioural · XFA FormCalc · action graph · OCG cloaking · invisible text · trailer chain · codec exploits · entropy topology · image stego · compliance fraud · JS emulation · font CharString · XRef integrity · provenance · probabilistic · comparative · semantic · threat intelligence<br>47 engines
Structure Validator
Inspects PDF header position, counts %%EOF markers (exploit PDFs often carry multiple), audits cross-reference table depth, linearisation flags, and excessive filter chains used for obfuscation. Also detects linearized first-page object overrides — incremental updates that re-define an existing Page 1 object (same OID) to inject JavaScript or actions. Renderers that fast-path Page 1 via the linearization hint table never re-evaluate the override on initial render, making the injected content invisible until the page is refreshed. File-level polyglot detection — checks whether a recognised format magic signature (JPEG FF D8 FF, ZIP PK\x03\x04, PNG, GIF, Gzip, OLE, RIFF) appears in the bytes before the %PDF- header; ISO 32000 §7.5.2 NOTE 1 permits arbitrary pre-header bytes for legitimate reasons (e.g., PostScript DSC prefix), but a recognised format signature at byte 0 is characteristic of deliberate polyglot construction to bypass format-based email security gateways that classify files by their first bytes. PDF 2.0 (ISO 32000-2) structures — records the /DPartRoot document-part hierarchy (§14.12, PDF/VT) and tagged-PDF /Namespaces (§14.7.4); both are neutral structural features, the latter part of the accessibility/semantic layer that reality-drift attacks target.
Pattern Scanner
45+ byte-level signatures: /JavaScript /Launch /OpenAction /EmbeddedFile /JBIG2Decode /XFA /RichMedia, NOP sleds (%u9090 %u4141), heapspray fills, and dangerous JS APIs: eval() unescape() collab.getIcon() util.printf().
Stream Inspector
Decompresses every FlateDecode stream via PyMuPDF and re-scans the raw content — catching JavaScript and shellcode hidden inside compressed objects that raw-byte scanners miss entirely. Calculates Shannon entropy per stream; values above 7.2 bits flag encrypted or packed payloads.
Object Analyzer
Walks the full cross-reference object graph, resolving indirect references and checking every object dictionary for dangerous action-type combinations (/S /Launch, /S /JavaScript, /RichMedia, /XFA). Reports exact xref numbers of suspicious objects.
URL Extractor
Extracts all HTTP/HTTPS URLs from raw bytes and decompressed streams, de-duplicates, and lists them so you can assess every domain the PDF attempts to contact — phoning home, tracking pixels, and C2 beaconing.
Metadata Analyzer
Inspects Producer and Creator fields for known exploit-tool strings (Metasploit, Canvas, Core Impact), flags missing metadata — a hallmark of crafted exploits — and scans XMP streams for embedded script references.
Font Analyzer
Checks every font object for /JBIG2Decode usage — the codec exploited in CVE-2009-0658 and CVE-2010-0188 — and for abnormally large /Widths arrays used in historic heap-overflow...