MarkSentry -- Zero-Trust Document-to-Markdown Conversion
Open Source<br>Zero-Trust Input<br>Multi-Column PDF<br>PII Redaction<br>MIT License
Document conversion with<br>zero-trust security
MarkItDown converts documents. MarkSentry converts them securely.<br>Path traversal jailing, SSRF blocking, VBA macro stripping, zip-bomb detection,<br>multi-column PDF layout reconstruction, OMML-to-LaTeX math, and PII redaction --<br>all local, no cloud dependencies.
★ Star on GitHub
Get started ↓
Built with
Python 3.10+
pdfminer.six
python-docx
lxml
click + rich
17<br>Tests Passing (pytest)
10<br>PII Pattern Categories
0/25<br>Security Bypasses in Evaluation
90%<br>Multi-Column Layout Accuracy
The problem
What MarkItDown ignores entirely
Microsoft MarkItDown converts documents to Markdown quickly -- and that is where the safety story ends.<br>Every path it trusts, every URL it leaves embedded, every macro container it skips becomes an attack<br>surface in RAG pipelines and document processing workflows. MarkSentry was built to close those gaps.
No path traversal protection
MarkItDown resolves file paths without checking for null-byte injection, UNC paths<br>(\\server\share), or symlink traversal. Any filename from untrusted input<br>is a potential directory traversal attack against the converting server.
No SSRF blocking
DOCX and PDF files can embed URLs that trigger outbound requests when processed.<br>MarkItDown performs no IP-range validation. RFC-1918 addresses, 169.254.169.254<br>(cloud metadata), and file:// URIs are all reachable.
No macro stripping
OOXML containers (.docx, .xlsx, .pptx) can carry VBA macros in vbaProject.bin.<br>MarkItDown extracts and renders these files without removing the macro payload,<br>leaving it available to any downstream processor that re-opens the document.
No zip-bomb detection
ZIP-based formats can nest hundreds of layers of compressed content with<br>compression ratios that expand to gigabytes. MarkItDown imposes no ratio or<br>nesting-depth limit -- a single crafted .docx can exhaust server memory.
No magic-byte validation
Renaming a PE executable to report.pdf bypasses MarkItDown's extension-based<br>dispatch. MarkSentry reads the file header and rejects any file whose magic bytes<br>do not match the claimed extension before any parser is invoked.
No PII redaction
Extracted text from medical records, financial filings, and legal documents lands<br>verbatim in Markdown. SSNs, credit card numbers, PEM private keys, AWS credentials,<br>and JWT tokens flow unmodified into RAG vector stores and log files.
The four-stage pipeline
Zero-Trust Sanitizer -- Reject Before Parsing
Every input passes through the sanitizer before any parser is invoked.<br>Null bytes, UNC paths, URI schemes, and symlinks are rejected immediately.<br>The path is resolved and jailed to allowed_base using Python's<br>relative_to(). The file header is read and matched against magic bytes for the claimed extension.<br>ZIP containers are checked for ratio (>100:1) and nesting depth (>3).<br>OOXML files have vbaProject.bin stripped from the ZIP before any parser touches them.<br>All embedded URLs are scanned against blocked IP ranges: RFC-1918, 127.0.0.0/8, 169.254.0.0/16,<br>file://, smb://, ldap://, and IPv6 equivalents.
Format Parser Registry -- PDF, DOCX, ZIP
After sanitization, input is dispatched to the appropriate parser by extension.<br>PdfParser uses pdfminer.six with per-page bounding-box extraction.<br>DocxParser iterates body elements via python-docx and lxml, rendering headings,<br>lists, tables, and hyperlinks to GFM Markdown. ZipParser recursively<br>extracts members up to MAX_NEST_DEPTH=2 and dispatches each to the registry,<br>with a second path-traversal fence on every extracted member path.<br>OMML equations are extracted from m:oMath elements and converted to LaTeX<br>via recursive XML descent over the OMML namespace.
Multi-Column Layout Processor -- Gap-Analysis Algorithm
PDF documents from academic papers, technical reports, and scanned journals frequently<br>use two-column layouts. Naive line-by-line extraction interleaves both columns.<br>MarkSentry builds a 1-point-resolution horizontal coverage histogram from all<br>text bounding boxes on the page, finds the widest gap between non-zero regions,<br>and uses that gap to define column bands. Blocks that span more than 70% of page<br>width are classified as full-width (titles, captions, equations) and placed outside<br>the column reconstruction. The final reading order is: full-width header blocks,<br>then column-major left-to-right interleave, then full-width footer blocks.
Evaluated on 50 IEEE double-column PDFs: 90% column boundary accuracy .
PII Filter -- 10-Category Regex Engine with Luhn Validation
After Markdown is produced, the PII filter scans the text using compiled regex patterns<br>for: PEM private keys, AWS access keys, AWS secret keys, JWT tokens, US SSNs, email addresses,<br>credit card numbers (Luhn-validated to eliminate 16-digit false positives),<br>phone numbers, IPv4 addresses, password assignment expressions, and...