Aurora DSQL: The Adjudicator

steepben1 pts0 comments

Aurora DSQL: The Adjudicator

Aurora DSQL: The Adjudicator

This article is about an internal component of the Aurora DSQL architecture that<br>we call the Adjudicator. I&rsquo;d recommend reading at least my article on the Circle<br>of Life or Marc Brooker&rsquo;s blog on Transactions and Durability before continuing<br>with this one.

Unlike other distributed databases, DSQL fully decouples reads and writes. In<br>order to commit a transaction, we just need to durably record it in the<br>journal. If only it were so simple. The Adjudicator&rsquo;s job is to abstract over<br>the many (many) challenges in getting this right.

Conflicts

The first design challenge is: am I even allowed to commit? Consider a<br>transaction to pay a debt:

SELECT balance FROM accounts WHERE id = 1;<br>-- returns 100

-- application checks: are there sufficient funds?<br>-- -> yes<br>UPDATE accounts SET balance = balance - 100 WHERE id = 1;

-- now the credit<br>UPDATE accounts SET balance = balance + 100 WHERE id = 2;

This transaction should commit on its own, but if another transaction<br>concurrently tried to deduct funds from the same account (id 1), then only one<br>should be allowed to commit.

DSQL does these checks optimistically at commit time. In a pessimistic system,<br>the first transaction to UPDATE id 1 would hold a lock until commit time, and<br>the second transaction would wait on that lock.

The Query Processor (QP) and Adjudicator work together to implement this. The QP<br>assembles a transaction that looks (conceptually) like this:

transaction:<br>start_time: xxx

writes:<br>accounts:<br>1: { balance: 0 }<br>2: { balance: 200 }

You&rsquo;ll be pleased to know we do not, in fact, use YAML in DSQL.

The Adjudicator will check if any of proposed changes conflict with any other<br>recently written keys, based on the transaction&rsquo;s start time. Because DSQL is<br>using accurate clocks and never serves stale reads, any transaction that<br>started after a key was last written is therefore guaranteed to have seen the<br>latest data.

If all checks pass, the Adjudicator picks a commit time, then atomically commits<br>the transaction to the journal (more on this later).

Pipelining

The journal&rsquo;s API supports pipelining, which is to say that the Adjudicator<br>doesn&rsquo;t need to wait for an acknowledgment before committing the new<br>transaction. This is important because replication across Availability Zones (or<br>AWS Regions) takes time, and we don&rsquo;t want to be waiting.

In order to support this, writes are pre-conditioned in some way. The journal<br>has several features to support this. The simplest one to explain is &ldquo;expected<br>sequence number&rdquo;. This allows the Adjudicator to blast writes at the journal:

sequence=1, expected=0: [record some data]

sequence=2, expected=1: [record some data]

sequence=3, expected=2: [record some data]

Without pipelining, every one of these 3 writes would need to wait on<br>replication. For example, if replication took 3ms, then we&rsquo;d be looking at 9ms<br>in total. With pipelining, the total time is reduced to ~3ms.

This is safe because failure cascades. If the write at sequence=2 is<br>rejected, then the write at sequence=3 is also rejected, because the journal<br>never reached sequence=2 to satisfy its expectation. The Adjudicator can&rsquo;t<br>end up with partial or out-of-order commits.

Expected sequence numbers are one of the first such features the journal added,<br>but they&rsquo;ve since added concepts such as &ldquo;generations&rdquo; that we make heavy use<br>of. So, why would your expectations not match reality? Let&rsquo;s talk about<br>failure.

Side note: both the Adjudicator and journal are written in Rust, and are based<br>around async concepts.

Availability

If an Adjudicator explodes immediately after committing the transaction, the<br>transaction is still durable. Standby Adjudicators (in other Availability<br>Zones) are tailing the journal and learning about all the committed<br>transactions. In addition to committing transactions, the leader also writes<br>heartbeats into the journal. These heartbeats serve multiple purposes (see<br>The flow of time in the Circle of Life), including failure detection of the<br>leader.

If a standby detects the leader has become unavailable, it will try and become<br>the leader. As previously mentioned the journal&rsquo;s API provides us tools to not<br>only elect a new leader, but to also fence off the old leader. Consider:

Adjudicator A is the leader, and temporarily loses network connectivity

B becomes the leader

A regains connectivity, attempts to record a transaction

The journal guarantees that A&rsquo;s attempt to commit will be rejected.

Together, these properties mean that DSQL can recover from failure extremely<br>quickly (hundreds of milliseconds). Obviously, the Adjudicator itself is rock<br>solid and doesn&rsquo;t crash, but infrastructure failures do happen from time to<br>time.

We&rsquo;ve also optimized for graceful handover, so that we can deploy with even<br>less impact. This is really important to us, because...

rsquo adjudicator transaction journal dsql time

Related Articles