Benchmark object storage in objects/SEC, not GB/s

Object storage is becoming primary storage. Stop benchmarking it only in GB/s. | FractalBits Blog<br>Object storage used to be the cold tier: backups, logs, the data lake you scanned once in a while. Two things mattered, price per gigabyte and how fast you could read it all back, and nothing that needed to be fast ran on it. That’s changing. Databases, streaming systems, and training pipelines now keep their live data in object storage, not just the backup copy. So the usual way of benchmarking it no longer tells you enough.

What’s actually happening

Streaming went first. The “diskless” Kafka systems (WarpStream, AutoMQ, Aiven Inkless, StreamNative Ursa, and now KIP-1150 in upstream Kafka) write the log straight to S3 instead of copying it across brokers, so the broker’s local disk is no longer the source of truth. Confluent bought WarpStream, which tells you where this is going.

Databases are doing the same. ClickHouse Cloud keeps its primary data in S3. Neon runs Postgres with its pages on S3. turbopuffer built search that lives on object storage, and Cursor and Notion use it. SlateDB is an LSM tree that sits in a bucket. Some now argue for building the database on object storage from the start.

Then there’s AI. A training run reads millions of small files. AWS built vector storage into S3 for semantic search, RAG, and agent memory, where each vector is a small record and the workload piles up a lot of them. Either way: huge numbers of small operations, exactly what a throughput number hides.

Why anyone would do this

The appeal is obvious if you’ve run stateful systems in the cloud. Once the object store holds the durable copy, your compute nodes don’t have to keep any state: they start in seconds, die without taking data with them, and you keep one copy instead of moving it around with ETL. It’s cheaper too: ClickHouse reports up to 65% lower TCO for the cached data it keeps this way.

What held it back was latency. Object storage is slow, tens to hundreds of milliseconds, fine for a backup but useless when something is waiting on it. The usual fix is a cache: turbopuffer puts an SSD cache in front of S3 because a cold read is about 870ms versus 14ms warm, but that cache is a whole extra layer to build and run. Or you just live with the latency, like WarpStream’s 400 to 600ms p99 writes on plain S3.

The latency has come down, though. S3 Express One Zone now answers in single-digit milliseconds, and newer object stores go lower still, so “just put it in S3” can work without bolting on a cache. Either way, you’re now counting on the object store to be fast.

And when a user is waiting, latency isn’t one number per request. A single action turns into many object reads, a query walking index and data blocks, a consumer pulling a run of segments, and it finishes only when the last one comes back. That’s where the rare slow reads stop being a footnote. The Tail at Scale put numbers on it: say a read is quick almost always, around 10ms, but slow one time in a hundred, a full second. One slow read you’d barely notice. But spread one action across 100 reads and the whole action is slow 63% of the time. A slow case that’s rare for a single read is the common case for the request, and a throughput average hides it.

GB/s measures the easy thing

GB/s became the headline back when object storage was a cold tier and throughput was all it needed to prove. The habit stuck, and not just for object stores. MinIO, an S3-compatible object store, leads with GiB/s; the wider storage-for-AI field does the same, the caching layer Alluxio with 11.5 GiB/s per node and the file system WEKA with a terabyte a second.

But throughput is just object rate times object size, so big objects make a big GB/s easy. MinIO’s docs run 64 MiB GETs at 3.2 GiB/s, which is only 51 objects a second. Hit the same GB/s with 4 KiB objects and you need about 840,000 a second, sixteen thousand times the metadata work behind the same headline.

And small objects aren’t a corner case. Some workloads are full of them: a training run reads millions of small files, each one an object, and most buckets fill up with small files anyway, logs, images, JSON, model shards. Other systems try to dodge small objects. They batch writes into bigger objects, or put a cache in front, or add a tier. But each one is a layer you have to build, run, and pay for. And none of them make the small operations go away. They just hide how slow the store would be if you hit it directly. So the real question is simple: how fast is the store at small operations? That tells you how many of those extra layers you really need.

That’s why counting bytes and counting requests give different answers. Most of the bytes sit in the biggest objects; most of the requests go to the smallest. So in a bucket with a mix of sizes, a few big objects set the GB/s number, while the small ones make up the work the store actually does. A good GB/s says the disks and network are healthy. It says...

Benchmark object storage in objects/SEC, not GB/s

Related Articles

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI