Big Data File Formats - luminousmen
luminousmen
SubscribeSign in
Big Data File Formats<br>Four ways to put bytes on disk, three of them are mistakes
luminousmen<br>Jun 30, 2026
Share
The file format wars are over. Parquet won. You can stop reading.<br>Fine, don’t — because I keep watching people who “know” Parquet won use it in ways that throw away the whole reason it won. And because the argument didn’t actually end. It moved one level up the stack, to where half of you are now quietly unsure whether Iceberg is a file format. (It’s not. We’ll get there.)<br>I wrote the first version of this post years ago, when this was still a live fight — real blog wars, real benchmarks, people with strong opinions about ORC. It needs an update. So here’s what these formats are, why they behave the way they do, and what changed.<br>One number first, to set the stakes. I once watched a one-line change — gzipped CSV to Parquet with zstd — cut a team’s S3 bill by 70% and drop query times by an order of magnitude. Nobody got promoted for it. Storage formats are plumbing and nobody thanks the plumber, but pick wrong and you pay for it on every query you ever run. A format decides three things for you:<br>how much space your data eats,
how fast you write it,
how fast you read it back.
You don’t get all three.<br>CSV
CSV will outlive us all. (c)
It’s plain text, every system can produce it, every system can read it, and when something breaks you can open the file and look at it with your eyes. That’s the whole appeal. For shoving a few thousand rows between two systems that share nothing else, CSV is fine.<br>For anything else it’s a nightmare.<br>CSV has no types — everything is a string until something downstream bets on what it means. And “bets” is the right word: Spark’s schema inference reads your data and guesses. I lost half a day once to a column of zip codes that inference decided were integers, dropped the leading zeros, and quietly broke a join three steps later. No error, no warning — just wrong numbers that three teams trusted for a week.<br># gambling<br>df = spark.read.csv("data.csv", header=True, inferSchema=True)
# engineering<br>df = spark.read.csv("data.csv", header=True, schema=explicit_schema)
💡 inferSchema isn’t free. It makes Spark read the whole file twice — one full pass to work out the types, another to actually load the data. On a big CSV that’s double the I/O before you’ve done anything useful. Pass an explicit schema and the first pass disappears.
There’s no real standard either. Commas, semicolons, tabs, quotes inside quotes, newlines inside fields, a dozen escaping conventions that all disagree with each other. “CSV” isn’t a format, it’s a family of formats that happen to share a file extension. And because it’s row-based text, an analytical query has to read and parse every byte even when you wanted one column out of fifty.<br>The one thing CSV does well is write fast — there’s nothing to encode. Which is exactly why it keeps turning up where it shouldn’t. It’s the path of least resistance, and least resistance is how most bad data decisions get made.<br>JSON
JSON does what CSV can’t — nested structures, arrays, types that mostly survive the round trip. It’s the native language of REST APIs and event streams, and as a format for moving data over a wire, it earned its spot. As a format for storing data, it’s ridiculous.<br>Every row drags the full set of keys along with it. Store a billion events and you’ve written the string "user_id": a billion times. Compression hides some of that, but you’re still parsing text, still scanning whole rows, still working without indexes or statistics. If you’re stuck with JSON, at least use JSONL — one object per line — so Spark (or whatever else) can split the file and read it in parallel.<br>💡 This is the whole reason JSONL exists. A regular JSON file is one big array — [ {...}, {...}, ... ] — and you can’t split it, because a parser has to read the opening bracket and everything after it as a single document. One core, whole file, same trap as .csv.gz. JSONL drops the array and puts one object per line, so every newline is a clean split point and the cluster can finally parallelize. Same data, one structural change, night-and-day read performance.
JSON shows up at the edge because that’s what the service upstream emits. Fine. It should die at the edge. Land it, parse it, write Parquet. When raw JSON travels three hops into your warehouse, someone is paying to parse the same text over and over, and that someone is you.<br>Avro
Avro is a row format someone actually sat down and designed, instead of one we inherited from spreadsheets. Binary, compact, splittable, and it carries its own schema — schema in JSON, data in binary, any reader can decode the file without knowing anything in advance.<br>Its best trick is schema evolution. Producer adds a field, old consumers keep working. Consumer expects a field with a default, old files still read fine. Backward and forward compatibility, with rules...