Aurora DSQL: The Adjudicator
Aurora DSQL: The Adjudicator
This article is about an internal component of the Aurora DSQL architecture that<br>we call the Adjudicator. I’d recommend reading at least my article on the Circle<br>of Life or Marc Brooker’s blog on Transactions and Durability before continuing<br>with this one.
Unlike other distributed databases, DSQL fully decouples reads and writes. In<br>order to commit a transaction, we just need to durably record it in the<br>journal. If only it were so simple. The Adjudicator’s job is to abstract over<br>the many (many) challenges in getting this right.
Conflicts
The first design challenge is: am I even allowed to commit? Consider a<br>transaction to pay a debt:
SELECT balance FROM accounts WHERE id = 1;<br>-- returns 100
-- application checks: are there sufficient funds?<br>-- -> yes<br>UPDATE accounts SET balance = balance - 100 WHERE id = 1;
-- now the credit<br>UPDATE accounts SET balance = balance + 100 WHERE id = 2;
This transaction should commit on its own, but if another transaction<br>concurrently tried to deduct funds from the same account (id 1), then only one<br>should be allowed to commit.
DSQL does these checks optimistically at commit time. In a pessimistic system,<br>the first transaction to UPDATE id 1 would hold a lock until commit time, and<br>the second transaction would wait on that lock.
The Query Processor (QP) and Adjudicator work together to implement this. The QP<br>assembles a transaction that looks (conceptually) like this:
transaction:<br>start_time: xxx
writes:<br>accounts:<br>1: { balance: 0 }<br>2: { balance: 200 }
You’ll be pleased to know we do not, in fact, use YAML in DSQL.
The Adjudicator will check if any of proposed changes conflict with any other<br>recently written keys, based on the transaction’s start time. Because DSQL is<br>using accurate clocks and never serves stale reads, any transaction that<br>started after a key was last written is therefore guaranteed to have seen the<br>latest data.
If all checks pass, the Adjudicator picks a commit time, then atomically commits<br>the transaction to the journal (more on this later).
Pipelining
The journal’s API supports pipelining, which is to say that the Adjudicator<br>doesn’t need to wait for an acknowledgment before committing the new<br>transaction. This is important because replication across Availability Zones (or<br>AWS Regions) takes time, and we don’t want to be waiting.
In order to support this, writes are pre-conditioned in some way. The journal<br>has several features to support this. The simplest one to explain is “expected<br>sequence number”. This allows the Adjudicator to blast writes at the journal:
sequence=1, expected=0: [record some data]
sequence=2, expected=1: [record some data]
sequence=3, expected=2: [record some data]
Without pipelining, every one of these 3 writes would need to wait on<br>replication. For example, if replication took 3ms, then we’d be looking at 9ms<br>in total. With pipelining, the total time is reduced to ~3ms.
This is safe because failure cascades. If the write at sequence=2 is<br>rejected, then the write at sequence=3 is also rejected, because the journal<br>never reached sequence=2 to satisfy its expectation. The Adjudicator can’t<br>end up with partial or out-of-order commits.
Expected sequence numbers are one of the first such features the journal added,<br>but they’ve since added concepts such as “generations” that we make heavy use<br>of. So, why would your expectations not match reality? Let’s talk about<br>failure.
Side note: both the Adjudicator and journal are written in Rust, and are based<br>around async concepts.
Availability
If an Adjudicator explodes immediately after committing the transaction, the<br>transaction is still durable. Standby Adjudicators (in other Availability<br>Zones) are tailing the journal and learning about all the committed<br>transactions. In addition to committing transactions, the leader also writes<br>heartbeats into the journal. These heartbeats serve multiple purposes (see<br>The flow of time in the Circle of Life), including failure detection of the<br>leader.
If a standby detects the leader has become unavailable, it will try and become<br>the leader. As previously mentioned the journal’s API provides us tools to not<br>only elect a new leader, but to also fence off the old leader. Consider:
Adjudicator A is the leader, and temporarily loses network connectivity
B becomes the leader
A regains connectivity, attempts to record a transaction
The journal guarantees that A’s attempt to commit will be rejected.
Together, these properties mean that DSQL can recover from failure extremely<br>quickly (hundreds of milliseconds). Obviously, the Adjudicator itself is rock<br>solid and doesn’t crash, but infrastructure failures do happen from time to<br>time.
We’ve also optimized for graceful handover, so that we can deploy with even<br>less impact. This is really important to us, because...