The AI-native document format

DocLang

Open standard · Joint Development Foundation Project

The AI-native document format.

PDF was built for print. DOCX was built for editors. DocLang is built for what comes next — a machine-readable document standard your models can actually trust.

Read the spec View on GitHub →

Founded by

Your documents are lying to your models.

The world's knowledge lives in formats designed for rendering, not understanding. Markdown was built for readers. HTML for browsers. LaTeX for typesetting. PDF for print. None were built for machines.

Modern AI pipelines assume clean, structured input. Real-world documents — contracts, invoices, research papers, regulatory filings — are none of those things. Parsers guess at reading order. Tables become flat text. Figures vanish. Metadata is stripped.

The result: your model's accuracy is bottlenecked by document quality, not model quality. You spend more engineering time wrangling pre-processing than building the product.

parse("quarterly_report.pdf")

reading_order

expected sequential hierarchy

received undefined

table_structure

expected 3×12 grid with merged cells

received flat string (156 chars)

figure_references

expected 8 embedded figures

received 0 (omitted)

document_metadata

expected { author, created, lang }

received null

4 failed 0 passed

A document representation built for how AI actually reads.

DocLang defines a structured, machine-readable format for documents of any type. Not a converter. Not an API. A standard — like JSON for data, like HTML for the web — that any tool can implement and any pipeline can consume.

Every component carries a semantic tag, bounding box coordinates, and reading order — natively encoded in a format LLM tokenizers can parse without translation overhead. A table encodes its full grid structure via OTSL. A heading carries its level and page position. Your model doesn't have to guess. Governance metadata — PII flags, RAG permissions, training constraints — lives inside , not in a sidecar file.

The same standard extends beyond text documents. Audio transcripts, images, and video segments encode as first-class elements — speakers, timestamps, and scenes using the same primitives as headings and tables.

What your parser returns

Q3 2024Financial Re port Net Revenue42 M51M39M Figure3.2 omitted author:null

What DocLang returns

J. Smith level="1">Q3 2024 Revenue$42M uri="fig-3.2.png"/>

Six properties. No compromises.

AI-native

Every element maps directly to LLM tokens. No translation layers, no postprocessing, no structural guesswork.

Lossless

Tables keep their full grid structure. Figures keep their position. Reading order is preserved, not inferred.

Expressive

Semantic roles, bounding boxes, document hierarchy — all fully encoded. Your model stops hallucinating structure.

Beyond documents

Audio transcripts, images, video segments — same format, same primitives. Speakers, timestamps, and scenes are native elements.

Unambiguous

One canonical representation per content type. No parser-dependent variance. Every tool produces the same output.

Open

A Joint Development Foundation Projects standard and LF AI & Data project. Public spec, open working group, no lock-in.

The business context layer for enterprise AI.

AI is only as reliable as the context it receives. DocLang transforms documents into structured business context that can be trusted across AI agents, workflows, and enterprise systems.

Business context, preserved

Structure alone is not enough. DocLang preserves the meaning, relationships, and business context behind your documents so AI systems can act on knowledge, not just content.

Fewer errors, faster decisions

Reliable structure means fewer errors in automated document workflows — fewer manual reviews, lower compliance exposure, and faster time-to-decision.

Audit-ready by default

Compliance metadata travels with the document, not alongside it. Legal and compliance teams define rules once, and every downstream system reads them automatically.

No lock-in, ever

Swap components as the market evolves. Your documents stay portable because the speicification is standardized and any vendor can implement it.

AI-native document format specification

DocLang is a constrained XML format built from the ground up for LLM tokenizers — a 1-to-1 mapping between DocLang tokens and model tokens, with minimal token count. Every component carries semantic role, geometric bounding box, and reading order. Tables use OTSL: 5 structural tokens where HTML needs 28.

Full spec and reference implementation on GitHub →

DocLang quarterly_report.dclg.xml

level="1"> value="48"/> value="40"/> value="420"/> value="72"/> Q3 Financial Summary

value="48"/> value="88"/> value="420"/> value="168"/> QuarterRevenueYoY Q3 2024$42M+18%

Join the working group.

The spec, the reference implementation, and the working group processes are all public. The standard improves when more perspectives are in the room.

Ready to get...

The AI-native document format

Related Articles

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

German ruling declares Google liable for false answers in AI Overviews

Britain Became as Poor as Mississippi