MarkSentry – zero-trust document-to-Markdown for RAG pipelines

MarkSentry -- Zero-Trust Document-to-Markdown Conversion

Open Source Zero-Trust Input Multi-Column PDF PII Redaction MIT License

Document conversion with zero-trust security

MarkItDown converts documents. MarkSentry converts them securely. Path traversal jailing, SSRF blocking, VBA macro stripping, zip-bomb detection, multi-column PDF layout reconstruction, OMML-to-LaTeX math, and PII redaction -- all local, no cloud dependencies.

★ Star on GitHub

Get started ↓

Built with

Python 3.10+

pdfminer.six

python-docx

lxml

click + rich

17 Tests Passing (pytest)

10 PII Pattern Categories

0/25 Security Bypasses in Evaluation

90% Multi-Column Layout Accuracy

The problem

What MarkItDown ignores entirely

Microsoft MarkItDown converts documents to Markdown quickly -- and that is where the safety story ends. Every path it trusts, every URL it leaves embedded, every macro container it skips becomes an attack surface in RAG pipelines and document processing workflows. MarkSentry was built to close those gaps.

No path traversal protection

MarkItDown resolves file paths without checking for null-byte injection, UNC paths (\\server\share), or symlink traversal. Any filename from untrusted input is a potential directory traversal attack against the converting server.

No SSRF blocking

DOCX and PDF files can embed URLs that trigger outbound requests when processed. MarkItDown performs no IP-range validation. RFC-1918 addresses, 169.254.169.254 (cloud metadata), and file:// URIs are all reachable.

No macro stripping

OOXML containers (.docx, .xlsx, .pptx) can carry VBA macros in vbaProject.bin. MarkItDown extracts and renders these files without removing the macro payload, leaving it available to any downstream processor that re-opens the document.

No zip-bomb detection

ZIP-based formats can nest hundreds of layers of compressed content with compression ratios that expand to gigabytes. MarkItDown imposes no ratio or nesting-depth limit -- a single crafted .docx can exhaust server memory.

No magic-byte validation

Renaming a PE executable to report.pdf bypasses MarkItDown's extension-based dispatch. MarkSentry reads the file header and rejects any file whose magic bytes do not match the claimed extension before any parser is invoked.

No PII redaction

Extracted text from medical records, financial filings, and legal documents lands verbatim in Markdown. SSNs, credit card numbers, PEM private keys, AWS credentials, and JWT tokens flow unmodified into RAG vector stores and log files.

The four-stage pipeline

Zero-Trust Sanitizer -- Reject Before Parsing

Every input passes through the sanitizer before any parser is invoked. Null bytes, UNC paths, URI schemes, and symlinks are rejected immediately. The path is resolved and jailed to allowed_base using Python's relative_to(). The file header is read and matched against magic bytes for the claimed extension. ZIP containers are checked for ratio (>100:1) and nesting depth (>3). OOXML files have vbaProject.bin stripped from the ZIP before any parser touches them. All embedded URLs are scanned against blocked IP ranges: RFC-1918, 127.0.0.0/8, 169.254.0.0/16, file://, smb://, ldap://, and IPv6 equivalents.

Format Parser Registry -- PDF, DOCX, ZIP

After sanitization, input is dispatched to the appropriate parser by extension. PdfParser uses pdfminer.six with per-page bounding-box extraction. DocxParser iterates body elements via python-docx and lxml, rendering headings, lists, tables, and hyperlinks to GFM Markdown. ZipParser recursively extracts members up to MAX_NEST_DEPTH=2 and dispatches each to the registry, with a second path-traversal fence on every extracted member path. OMML equations are extracted from m:oMath elements and converted to LaTeX via recursive XML descent over the OMML namespace.

Multi-Column Layout Processor -- Gap-Analysis Algorithm

PDF documents from academic papers, technical reports, and scanned journals frequently use two-column layouts. Naive line-by-line extraction interleaves both columns. MarkSentry builds a 1-point-resolution horizontal coverage histogram from all text bounding boxes on the page, finds the widest gap between non-zero regions, and uses that gap to define column bands. Blocks that span more than 70% of page width are classified as full-width (titles, captions, equations) and placed outside the column reconstruction. The final reading order is: full-width header blocks, then column-major left-to-right interleave, then full-width footer blocks.

Evaluated on 50 IEEE double-column PDFs: 90% column boundary accuracy .

PII Filter -- 10-Category Regex Engine with Luhn Validation

After Markdown is produced, the PII filter scans the text using compiled regex patterns for: PEM private keys, AWS access keys, AWS secret keys, JWT tokens, US SSNs, email addresses, credit card numbers (Luhn-validated to eliminate 16-digit false positives), phone numbers, IPv4 addresses, password assignment expressions, and...

MarkSentry – zero-trust document-to-Markdown for RAG pipelines

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs