DocLang
Open standard · Joint Development Foundation Project
The AI-native<br>document format.
PDF was built for print. DOCX was built for editors. DocLang is built for what comes next — a machine-readable document standard your models can actually trust.
Read the spec<br>View on GitHub →
Founded by
Your documents are lying<br>to your models.
The world's knowledge lives in formats designed for rendering, not understanding. Markdown was built for readers. HTML for browsers. LaTeX for typesetting. PDF for print. None were built for machines.
Modern AI pipelines assume clean, structured input. Real-world documents — contracts, invoices, research papers, regulatory filings — are none of those things. Parsers guess at reading order. Tables become flat text. Figures vanish. Metadata is stripped.
The result: your model's accuracy is bottlenecked by document quality, not model quality. You spend more engineering time wrangling pre-processing than building the product.
parse("quarterly_report.pdf")
reading_order
expected<br>sequential hierarchy
received<br>undefined
table_structure
expected<br>3×12 grid with merged cells
received<br>flat string (156 chars)
figure_references
expected<br>8 embedded figures
received<br>0 (omitted)
document_metadata
expected<br>{ author, created, lang }
received<br>null
4 failed<br>0 passed
A document representation built for how AI actually reads.
DocLang defines a structured, machine-readable format for documents of any type. Not a converter. Not an API. A standard — like JSON for data, like HTML for the web — that any tool can implement and any pipeline can consume.
Every component carries a semantic tag, bounding box coordinates, and reading order — natively encoded in a format LLM tokenizers can parse without translation overhead. A table encodes its full grid structure via OTSL. A heading carries its level and page position. Your model doesn't have to guess. Governance metadata — PII flags, RAG permissions, training constraints — lives inside , not in a sidecar file.
The same standard extends beyond text documents. Audio transcripts, images, and video segments encode as first-class elements — speakers, timestamps, and scenes using the same primitives as headings and tables.
What your parser returns
Q3 2024Financial Re<br>port Net Revenue42<br>M51M39M Figure3.2<br>omitted author:null
What DocLang returns
J. Smith<br>level="1">Q3 2024<br>Revenue$42M<br>uri="fig-3.2.png"/>
Six properties. No compromises.
AI-native
Every element maps directly to LLM tokens. No translation layers, no postprocessing, no structural guesswork.
Lossless
Tables keep their full grid structure. Figures keep their position. Reading order is preserved, not inferred.
Expressive
Semantic roles, bounding boxes, document hierarchy — all fully encoded. Your model stops hallucinating structure.
Beyond documents
Audio transcripts, images, video segments — same format, same primitives. Speakers, timestamps, and scenes are native elements.
Unambiguous
One canonical representation per content type. No parser-dependent variance. Every tool produces the same output.
Open
A Joint Development Foundation Projects standard and LF AI & Data project. Public spec, open working group, no lock-in.
The business context layer for enterprise AI.
AI is only as reliable as the context it receives. DocLang transforms documents into structured business context that can be trusted across AI agents, workflows, and enterprise systems.
Business context, preserved
Structure alone is not enough. DocLang preserves the meaning, relationships, and business context behind your documents so AI systems can act on knowledge, not just content.
Fewer errors, faster decisions
Reliable structure means fewer errors in automated document workflows — fewer manual reviews, lower compliance exposure, and faster time-to-decision.
Audit-ready by default
Compliance metadata travels with the document, not alongside it. Legal and compliance teams define rules once, and every downstream system reads them automatically.
No lock-in, ever
Swap components as the market evolves. Your documents stay portable because the speicification is standardized and any vendor can implement it.
AI-native document format specification
DocLang is a constrained XML format built from the ground up for LLM tokenizers — a 1-to-1 mapping between DocLang tokens and model tokens, with minimal token count. Every component carries semantic role, geometric bounding box, and reading order. Tables use OTSL: 5 structural tokens where HTML needs 28.
Full spec and reference implementation on GitHub →
DocLang<br>quarterly_report.dclg.xml
level="1"><br>value="48"/> value="40"/><br>value="420"/> value="72"/><br>Q3 Financial Summary
value="48"/> value="88"/><br>value="420"/> value="168"/><br>QuarterRevenueYoY<br>Q3 2024$42M+18%
Join the working group.
The spec, the reference implementation, and the working group processes are all public. The standard improves when more perspectives are in the room.
Ready to get...