Building a serializable database in Rust, and measuring what it costs

YahyaEhsansub2 pts0 comments

Building crackeddb, Yahya Ehsan

Write skew<br>Two transactions read a shared snapshot and write disjoint keys, forming a read-write cycle. RocksDB commits both and violates the invariant; crackeddb detects the cycle and aborts one, holding the invariant.

FIG.01

SHARED SNAPSHOT · t0<br>alice = on<br>bob = on<br>INVARIANT — AT LEAST ONE ON

READ

T1<br>reads alice, bob<br>writes alice = off

T2<br>reads alice, bob<br>writes bob = off

rw<br>rw<br>RW CYCLE — SSI ABORTS ONE

COMMIT

ROCKSDB<br>VIOLATED<br>alice=off, bob=off

CRACKEDDB<br>HELD<br>alice=off, bob=on

DIAGRAM — WRITE SKEW. TWO TRANSACTIONS READ A SHARED SNAPSHOT AND WRITE DISJOINT KEYS.<br>THE RW CYCLE IS WHAT CRACKEDDB DETECTS AND ROCKSDB DOES NOT.

two doctors are on call. the rule is that at least one of them has to stay on duty. each doctor independently checks that the other is still on, then takes themselves off. run that concurrently against most databases and you can end up with nobody on call. here it is against two:

target/release/write_skew_demostdout

write skew demo (concurrent)<br>initial: alice=on, bob=on<br>T1 (thread A): read both, write alice=off<br>T2 (thread B): read both, write bob=off<br>invariant: at least one stays on

crackeddb T1 ssi_abort=false T2 ssi_abort=true<br>crackeddb alice=off bob=on held<br>rocksdb T1=ok T2=ok<br>rocksdb alice=off bob=off VIOLATED

same scenario, same two transactions, two storage engines. t1 reads both keys and writes alice=off. t2 reads both keys and writes bob=off. the writes touch different keys, the reads overlap. rocksdb commits both and lands on alice=off, bob=off, a state no serial order of the two transactions could produce. crackeddb aborts one and the invariant holds. which one aborts changes between runs because the two threads race. that the invariant holds does not. five of five. source is in crates/bench/src/bin/write_skew_demo.rs.

that rocksdb commits both is not a bug. write skew is a known anomaly under snapshot isolation, and rocksdb's transactiondb does not promise to stop it. its pessimistic locking takes write locks on writes, not on reads, so two transactions that read shared keys and write disjoint ones never conflict in its model. the fix on rocksdb's side is to call get_for_update and take the read locks by hand. that is a real and defensible choice. but the demo shows the part worth sitting with: the anomaly was detectable the whole time. the database declined to detect it and made that the caller's problem.

that decline is the most common move in software. the flaky test you rerun until it passes. the heisenbug closed as cannot reproduce. the shrug that distributed systems are just like that. very little of that is fundamental randomness. most of it is manufactured, information some lower layer had and chose not to surface. crackeddb is a small argument the other way. it detects the anomaly rather than leaving it to the caller, it makes its own bugs reproduce byte for byte from a seed rather than leaving you to chase them, and it pays for both in throughput you can measure. the rest of this post is how it does that, and what the bill comes to.

01 — what it isthe shape, and the one thing it adds

crackeddb is a single node embedded database, roughly the shape of sqlite: it links into your process, it stores key value pairs, there is no server and no network. what it adds to that shape is serializable transactions, the strongest isolation level, enforced by serializable snapshot isolation rather than by locking everything in sight. underneath it is a log structured merge tree, the same family as rocksdb and leveldb. it is written in rust. it is not finished and it is not for production: no replication, no sql, no client library past the rust crate. the interesting part is not what it does. it is the constraints it holds itself to while doing it.

here is the part that is solid, stated plainly so the rest of the post reads as drawing boundaries rather than apologizing. the engine detects serialization anomalies, the write skew above is one, and aborts a transaction to keep them out. that detection rests on a protocol a model checker explored across 24.6 million states without finding a violation. every bug the engine has hit reproduces byte for byte from a seed. the demo at the top is that engine running. everything after this section is either how that was built or where the edges are.

the engine itself is ordinary in its parts: a log structured merge tree for storage, learned indexes over the keys in place of b trees (piecewise linear models following the pgm index), serializable snapshot isolation for transactions. what is less ordinary is the discipline around it, and that is what the rest of the post is about: deterministic simulation so bugs are reproducible, formal specs so the protocols are checkable, and a written record of every decision next to the test that pins it. the throughput numbers come later and they are not flattering. they are not the point. the point is what that discipline costs and what it buys.

02 —...

write alice rocksdb crackeddb read transactions

Related Articles