Show HN: Udoc. Dependency-free document extraction in Rust

udoc

Initializing search

newelh/udoc

Concepts

Examples

Reference

Formats

DOCX

XLSX

PPTX

DOC (legacy)

XLS (legacy)

PPT (legacy)

ODF (ODT/ODS/ODP)

RTF

Markdown

Architecture

Overview

Dependency free extraction from documents.

Extract text, tables, JSON, or rendered pages. CLI, Python Bindings, Pure Rust. No external parsers, libraries, or system packages are required. Provides hooks for OCR, layout detection, and entity extraction. Permissively licensed as dual MIT / Apache-2.0.

Supports PDF, DOC, DOCX, XLS, XLSX, PPT, PPTX, ODT, ODS, ODP, RTF, and Markdown.

Try it out using uv, no install required:

curl -sL https://arxiv.org/pdf/1706.03762 \ | uvx udoc - | grep -A 18 '^Abstract'

Installation¶

# uv uv add udoc

# pip pip install udoc

# cargo (coming soon)

To build from source, see Compiling from source.

Highlights¶

One Document model across formats. A content spine of Block and Inline nodes, plus optional presentation, relationships, and interactions overlays. Disable any overlay via Config.

Legacy binary Office. Native parsers for .doc, .xls, and .ppt. Per-format details in the format guides.

Streaming page-by-page. The Extractor defers per-page work. A 10 GB PDF does not have to fit in memory.

Typed diagnostics. Recoverable issues become structured warnings filterable by kind. Examples: font fallbacks, malformed xref, stream-length mismatches.

Hooks for OCR, layout, and annotation. JSONL protocol for Tesseract, cloud OCR APIs, DocLayout-YOLO, GLM-OCR, vision-language models, NER, or any subprocess that reads JSON line-by-line.

LLM tool use. Agent instructions — a paste-into-context page describing udoc's CLI to assistants.

Usage¶

CLI¶

udoc paper.pdf # text to stdout udoc -j paper.pdf # full document as JSON udoc -J paper.pdf # streaming JSONL (one record per page) udoc -t spreadsheet.xlsx # tables only as TSV udoc -p 1-5,10 paper.pdf # page range udoc render paper.pdf -o ./pages # rasterise PDF pages to PNG cat paper.pdf | udoc - # read from stdin

A few real-world piping recipes:

curl -sL https://arxiv.org/pdf/1706.03762 | udoc - | head -40 udoc paper.pdf | grep -i 'attention' udoc -J docs/*.pdf | jq '.metadata.title'

Plain text on stdout. Structured output on flags. Stderr is silent unless you pass -v. The full flag list lives in the CLI reference.

Python¶

import udoc

# One-shot extraction. Format detected from magic bytes. doc = udoc.extract("paper.pdf") print(doc.metadata.title) for block in doc.blocks(): print(block.text)

# Stream page by page; large documents do not have to fit in memory. with udoc.stream("large.pdf") as ext: for i in range(len(ext)): print(f"page {i}: {ext.page_text(i)[:80]}")

# In-memory bytes with options. with open("encrypted.pdf", "rb") as f: doc = udoc.extract_bytes(f.read(), password="secret")

PDF table detection and reading order are heuristic. Born-digital documents with clean ruling and standard column flow extract cleanly out of the box; the PDF format guide covers the failure modes and when to attach a layout-detection or OCR hook.

The Guide walks through configuration, overlays, diagnostics, chunking, and batch processing. The Python Library reference lists every function, class, and exception.

Rust¶

let doc = udoc::extract("paper.pdf")?; println!("{:?}", doc.metadata.title); for block in &doc.content { println!("{}", block.text());

The Rust facade mirrors the Python shape. Document is udoc_core::document::Document; iteration is by direct field access (doc.content, doc.metadata, doc.images). The Rust Library reference covers the facade, the per-format backends, configuration presets, diagnostics, and the trait that backends implement.

Show HN: Udoc. Dependency-free document extraction in Rust

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast