MarkSentry – zero-trust document-to-Markdown for RAG pipelines

sunilgentyala1 pts0 comments

MarkSentry -- Zero-Trust Document-to-Markdown Conversion

Open Source<br>Zero-Trust Input<br>Multi-Column PDF<br>PII Redaction<br>MIT License

Document conversion with<br>zero-trust security

MarkItDown converts documents. MarkSentry converts them securely.<br>Path traversal jailing, SSRF blocking, VBA macro stripping, zip-bomb detection,<br>multi-column PDF layout reconstruction, OMML-to-LaTeX math, and PII redaction --<br>all local, no cloud dependencies.

★ Star on GitHub

Get started &darr;

Built with

Python 3.10+

pdfminer.six

python-docx

lxml

click + rich

17<br>Tests Passing (pytest)

10<br>PII Pattern Categories

0/25<br>Security Bypasses in Evaluation

90%<br>Multi-Column Layout Accuracy

The problem

What MarkItDown ignores entirely

Microsoft MarkItDown converts documents to Markdown quickly -- and that is where the safety story ends.<br>Every path it trusts, every URL it leaves embedded, every macro container it skips becomes an attack<br>surface in RAG pipelines and document processing workflows. MarkSentry was built to close those gaps.

No path traversal protection

MarkItDown resolves file paths without checking for null-byte injection, UNC paths<br>(\\server\share), or symlink traversal. Any filename from untrusted input<br>is a potential directory traversal attack against the converting server.

No SSRF blocking

DOCX and PDF files can embed URLs that trigger outbound requests when processed.<br>MarkItDown performs no IP-range validation. RFC-1918 addresses, 169.254.169.254<br>(cloud metadata), and file:// URIs are all reachable.

No macro stripping

OOXML containers (.docx, .xlsx, .pptx) can carry VBA macros in vbaProject.bin.<br>MarkItDown extracts and renders these files without removing the macro payload,<br>leaving it available to any downstream processor that re-opens the document.

No zip-bomb detection

ZIP-based formats can nest hundreds of layers of compressed content with<br>compression ratios that expand to gigabytes. MarkItDown imposes no ratio or<br>nesting-depth limit -- a single crafted .docx can exhaust server memory.

No magic-byte validation

Renaming a PE executable to report.pdf bypasses MarkItDown's extension-based<br>dispatch. MarkSentry reads the file header and rejects any file whose magic bytes<br>do not match the claimed extension before any parser is invoked.

No PII redaction

Extracted text from medical records, financial filings, and legal documents lands<br>verbatim in Markdown. SSNs, credit card numbers, PEM private keys, AWS credentials,<br>and JWT tokens flow unmodified into RAG vector stores and log files.

The four-stage pipeline

Zero-Trust Sanitizer -- Reject Before Parsing

Every input passes through the sanitizer before any parser is invoked.<br>Null bytes, UNC paths, URI schemes, and symlinks are rejected immediately.<br>The path is resolved and jailed to allowed_base using Python's<br>relative_to(). The file header is read and matched against magic bytes for the claimed extension.<br>ZIP containers are checked for ratio (>100:1) and nesting depth (>3).<br>OOXML files have vbaProject.bin stripped from the ZIP before any parser touches them.<br>All embedded URLs are scanned against blocked IP ranges: RFC-1918, 127.0.0.0/8, 169.254.0.0/16,<br>file://, smb://, ldap://, and IPv6 equivalents.

Format Parser Registry -- PDF, DOCX, ZIP

After sanitization, input is dispatched to the appropriate parser by extension.<br>PdfParser uses pdfminer.six with per-page bounding-box extraction.<br>DocxParser iterates body elements via python-docx and lxml, rendering headings,<br>lists, tables, and hyperlinks to GFM Markdown. ZipParser recursively<br>extracts members up to MAX_NEST_DEPTH=2 and dispatches each to the registry,<br>with a second path-traversal fence on every extracted member path.<br>OMML equations are extracted from m:oMath elements and converted to LaTeX<br>via recursive XML descent over the OMML namespace.

Multi-Column Layout Processor -- Gap-Analysis Algorithm

PDF documents from academic papers, technical reports, and scanned journals frequently<br>use two-column layouts. Naive line-by-line extraction interleaves both columns.<br>MarkSentry builds a 1-point-resolution horizontal coverage histogram from all<br>text bounding boxes on the page, finds the widest gap between non-zero regions,<br>and uses that gap to define column bands. Blocks that span more than 70% of page<br>width are classified as full-width (titles, captions, equations) and placed outside<br>the column reconstruction. The final reading order is: full-width header blocks,<br>then column-major left-to-right interleave, then full-width footer blocks.

Evaluated on 50 IEEE double-column PDFs: 90% column boundary accuracy .

PII Filter -- 10-Category Regex Engine with Luhn Validation

After Markdown is produced, the PII filter scans the text using compiled regex patterns<br>for: PEM private keys, AWS access keys, AWS secret keys, JWT tokens, US SSNs, email addresses,<br>credit card numbers (Luhn-validated to eliminate 16-digit false positives),<br>phone numbers, IPv4 addresses, password assignment expressions, and...

column markitdown marksentry zero markdown path

Related Articles