There are only two file formats, txt and zip (explainer)

petervandijck2 pts0 comments

There are only two file formats: zip and txt - Parse for Artisans

Parse for Artisans

Blog

There are only two file formats: zip and txt

Peter Van Dijck<br>parseforartisans.com/blog

Home

Docs

Getting Started

Introduction<br>Installation<br>Usage

Parsing Documents<br>Handling Results<br>Going Further

Local Development<br>Testing<br>Reference

Supported Formats<br>API Reference

Blog

Pricing

← Blog

JSON? Text.<br>EPUB? Zip.<br>CSV? Text.<br>.docx? Zip.<br>SVG? Text.<br>.jar? Zip.<br>YAML? Text.<br>.apk? Zip.<br>Strip the branding off a file and what you have left is usually one of two things: a text file you can open in any editor, or a ZIP archive full of smaller files. And those smaller files are very often text.<br>Where the line comes from<br>This is old programmer folklore. The usual phrasing is "there are only two file formats worth using: text files and zips of text files," and it has floated around forums for years.1 Every so often someone reframes it as a punchy list and it goes viral again,2 usually with a Calvin and Hobbes panel attached.3<br>The text half<br>A lot of what we treat as separate formats are plain text with rules bolted on.<br>JSON is defined by its spec as "a text format for the serialization of structured data."4<br>CSV is tabular data stored "in plain text," with its own registered text/csv media type.5<br>SVG is an XML vocabulary, so an image file you can read line by line.6<br>Add YAML, TOML, INI, Markdown, HTML, and every source file you have ever written.<br>Some of these hide better than others. A Jupyter notebook (.ipynb) is JSON. So are GeoJSON maps and glTF 3D scenes.<br>Surprisingly, also ZIP<br>A pile of formats that look proprietary are ZIP archives wearing a different extension. You can rename their extension to .zip, unzip it, and read what is inside. Here are two real ones, a Word document and an EPUB e-book. Click through the folders and see for yourself:

Word, Excel, and PowerPoint files (.docx, .xlsx, .pptx) are ZIP packages defined by the Open Packaging Conventions in ECMA-376.7

OpenDocument files (.odt, .ods, .odp) are too. The OASIS spec states it plainly: "An OpenDocument Package shall be a Zip file."8

EPUB books are a ZIP container described by the W3C.9

A .jar is "a file format based on the popular ZIP file format," in Oracle's own words.10

Android apps (.apk) build on that JAR-and-ZIP structure, and iOS apps (.ipa) are ZIP archives of the app bundle.1112

Python wheels (.whl) are "a ZIP-format archive with a specially formatted file name," per PEP 427.13

Apple's AR format .usdz is a ZIP too, with one twist: it is deliberately uncompressed so apps can read assets in place without unpacking.14

Wait what?<br>The same is true of NuGet packages, VS Code extensions, Windows .appx installers, Apple iWork documents, Google Earth .kmz files, comic book .cbz archives, and 3D-printing .3mf models.<br>PK, as in Phil Katz<br>Open a .docx, an .epub, or a .jar in a text editor and the first two characters are PK. Those are the initials of Phil Katz, who created the ZIP format at PKWARE and released it in 1989.15 He died in 2000, after years of struggle with alcoholism.16 His initials now sit at the front of a large share of the files on every computer, phone, and e-reader on the planet.<br>There is a structural reason ZIP became the default container. A ZIP's index lives at the end of the file, so you read it back to front. That lets a format pin one small uncompressed file at the very front for fast identification. EPUB and OpenDocument both use this trick: the first entry is an uncompressed mimetype file, so a reader can tell what a document is without unpacking the whole archive.98<br>What does that look like?<br>You can watch both ideas at once in the first bytes of an EPUB. The PK signature is the first thing in the file, and because the mimetype is stored uncompressed, it sits there in plain text right after it:<br>$ xxd book.epub | head -3<br>00000000: 504b 0304 0a00 0000 0000 8861 d55c 6f61 PK.........a.\oa<br>00000010: ab2c 1400 0000 1400 0000 0800 0000 6d69 .,............mi<br>00000020: 6d65 7479 7065 6170 706c 6963 6174 696f metypeapplicatio<br>Read the right-hand column: PK, then mimetype, then applicatio(n/epub+zip).<br>Not all files<br>Plenty of formats are genuinely neither: PNG, JPEG, and GIF images; MP3, MP4, and WebM media; SQLite databases; Protocol Buffers; Parquet; WebAssembly modules; fonts. These are binary formats with their own layouts, and no amount of renaming turns them into text or a ZIP.<br>A few sit in the cracks. An .exe is normally a compiled binary, but a self-extracting archive is a valid .exe and a valid .zip at the same time: because a ZIP's index lives at the end of the file, the same bytes can satisfy both readers.1517 PDF is the mirror image. It is a mostly text-based structure whose embedded streams are usually compressed with DEFLATE, the same algorithm ZIP uses.18 PDF is the rare case of a text file containing a zip rather than the other way round.<br>But for the formats most of us create and parse on a normal day, the aphorism holds...

file text formats files epub format

Related Articles