Aurora DSQL: The Adjudicator

This article is about an internal component of the Aurora DSQL architecture that we call the Adjudicator. I’d recommend reading at least my article on the Circle of Life or Marc Brooker’s blog on Transactions and Durability before continuing with this one.

Unlike other distributed databases, DSQL fully decouples reads and writes. In order to commit a transaction, we just need to durably record it in the journal. If only it were so simple. The Adjudicator’s job is to abstract over the many (many) challenges in getting this right.

Conflicts

The first design challenge is: am I even allowed to commit? Consider a transaction to pay a debt:

SELECT balance FROM accounts WHERE id = 1; -- returns 100

-- application checks: are there sufficient funds? -- -> yes UPDATE accounts SET balance = balance - 100 WHERE id = 1;

-- now the credit UPDATE accounts SET balance = balance + 100 WHERE id = 2;

This transaction should commit on its own, but if another transaction concurrently tried to deduct funds from the same account (id 1), then only one should be allowed to commit.

DSQL does these checks optimistically at commit time. In a pessimistic system, the first transaction to UPDATE id 1 would hold a lock until commit time, and the second transaction would wait on that lock.

The Query Processor (QP) and Adjudicator work together to implement this. The QP assembles a transaction that looks (conceptually) like this:

transaction: start_time: xxx

writes: accounts: 1: { balance: 0 } 2: { balance: 200 }

You’ll be pleased to know we do not, in fact, use YAML in DSQL.

The Adjudicator will check if any of proposed changes conflict with any other recently written keys, based on the transaction’s start time. Because DSQL is using accurate clocks and never serves stale reads, any transaction that started after a key was last written is therefore guaranteed to have seen the latest data.

If all checks pass, the Adjudicator picks a commit time, then atomically commits the transaction to the journal (more on this later).

Pipelining

The journal’s API supports pipelining, which is to say that the Adjudicator doesn’t need to wait for an acknowledgment before committing the new transaction. This is important because replication across Availability Zones (or AWS Regions) takes time, and we don’t want to be waiting.

In order to support this, writes are pre-conditioned in some way. The journal has several features to support this. The simplest one to explain is “expected sequence number”. This allows the Adjudicator to blast writes at the journal:

sequence=1, expected=0: [record some data]

sequence=2, expected=1: [record some data]

sequence=3, expected=2: [record some data]

Without pipelining, every one of these 3 writes would need to wait on replication. For example, if replication took 3ms, then we’d be looking at 9ms in total. With pipelining, the total time is reduced to ~3ms.

This is safe because failure cascades. If the write at sequence=2 is rejected, then the write at sequence=3 is also rejected, because the journal never reached sequence=2 to satisfy its expectation. The Adjudicator can’t end up with partial or out-of-order commits.

Expected sequence numbers are one of the first such features the journal added, but they’ve since added concepts such as “generations” that we make heavy use of. So, why would your expectations not match reality? Let’s talk about failure.

Side note: both the Adjudicator and journal are written in Rust, and are based around async concepts.

Availability

If an Adjudicator explodes immediately after committing the transaction, the transaction is still durable. Standby Adjudicators (in other Availability Zones) are tailing the journal and learning about all the committed transactions. In addition to committing transactions, the leader also writes heartbeats into the journal. These heartbeats serve multiple purposes (see The flow of time in the Circle of Life), including failure detection of the leader.

If a standby detects the leader has become unavailable, it will try and become the leader. As previously mentioned the journal’s API provides us tools to not only elect a new leader, but to also fence off the old leader. Consider:

Adjudicator A is the leader, and temporarily loses network connectivity

B becomes the leader

A regains connectivity, attempts to record a transaction

The journal guarantees that A’s attempt to commit will be rejected.

Together, these properties mean that DSQL can recover from failure extremely quickly (hundreds of milliseconds). Obviously, the Adjudicator itself is rock solid and doesn’t crash, but infrastructure failures do happen from time to time.

We’ve also optimized for graceful handover, so that we can deploy with even less impact. This is really important to us, because...

Aurora DSQL: The Adjudicator

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy