protocols for transactional usage of object storage
SubscribeSign in
protocols for transactional usage of object storage<br>building correct applications w/o external coordination
almog gavra<br>May 18, 2026
Share
This edition of Bits & Pages covers the design patterns that allow systems to use Object Storage correctly for online transactional (OLTP) use cases.<br>This blog is motivated by SlateDB’s transactional write module, which implements some of these protocols for production use.<br>object storage primitives
The protocols that are outlined in the rest of this blog depend on a small set of APIs that are given to us by all common object storage systems.<br>write primitives
On the write side we’re given three primitives that we can use:
Unconditional Atomic Writes: There’s a normal PUT, which makes sure that data is written atomically. This does not prevent races from happening (i.e. writer A and writer B can write to the same path and there’s no guarantee to whose write wins) but it guarantees that the data written is entirely from one write (there is no interleaving data).<br>In the context of databases, this is a great way to model writing append-only files with unique IDs, which inherently avoid conflicts. It is not helpful for resolving situations where writes need to coordinate (though you do get read-your-writes, which is helpful for guaranteeing correctness during fail-overs). To do that the next two primitives are used.<br>Conditional Writes : There are two types of conditional writes that are subtly different but serve the same purpose. PUT If-None-Match and PUT If-Match both use compare-and-set operations on the object store’s metadata service to resolve conflicts. The former will only allow a write to happen if there does not already exist a file with that path. The latter will only allow a write to happen if the file that exists matches the content hash (or generation ID in some implementations) that you pass in.<br>These are typically used when writing and updating metadata with published, well-known file paths because they enable you to identify when a metadata conflict happened between writers.<br>Empirically it doesn’t seem like there’s much of an impact on performance whether you use a conditional write or an atomic write.
All write ops cost you the same amount, so feel free to use the conditional versions whenever you need to.<br>read primitives
On the read side, we use three primitives as well:
Atomic Reads: A normal GET request from object storage ensures that we will get a fully consistent snapshot of the file that is retrieved. This means that if a racing write to the same file happens, we won’t ever see that write partially applied.<br>Conditional Reads: Object stores also allow you to use GET If-None-Match to only fetch a file if its tag doesn’t match what you already have. This allows you to serve the request with a metadata-only fast-path, and avoid re-fetching locally cached data that hasn’t changed.<br>Consistent Listing: Supporting strongly consistent LIST requests means that as soon as a PUT returns 200 the result of that write will be available on subsequent LIST requests. This is typically used for metadata discovery, and prevents expensive conflicts from being discovered when you attempt to read files down the line.<br>Reads are where it gets interesting from a performance angle. The If-None-Match will quickly return a 304 code if the data has not changed since the last fetch, which allows faster repeated fetches. This opens the door to protocols that read the same file frequently.
The additional consideration to make is that there are differences in costs for the different read operations. For S3 you can see the following values, the interesting one is that LIST will cost you nearly 12x the amount of a GET.<br>Operation $/1K requests Body charge GET (unconditional) $0.0004 (Class B) $0.09/GB egress (internet); free same-region GET If-None-Match: $0.0004 (Class B) $0 on 304; $0.09/GB on 200 LIST $0.005 (Class A) response payload billed as egress (small)<br>protocols for correct transactional usage
The following section assumes a setup where you have applications interacting with databases that are backed by object storage alone. The database will implement correct transactional protocols under different conflict scenarios. We define “correct” as the ability to implement a serializable history of our data system.<br>When reasoning about these protocols, you should first evaluate whether or not they are safe. Crashes and failures should never result in inconsistent state. The second evaluation criterion is latency and throughput in steady state as well as under contention, when many writers are attempting to access the same files.<br>baseline protocol
We’ll start with a simple, but correct, database implementation using object storage. I call this protocol the “baseline” protocol, where every write directly goes to object storage using an atomic PUT.
This isn’t an interesting protocol, but it works. If at...