Aurora DSQL and the Circle of Life
Aurora DSQL and the Circle of Life
In this article we’re going to take a deeper look at the Circle of Life in<br>Aurora DSQL. Understanding the flow of data will really help you wrap your head<br>around the DSQL architecture, and how best to build applications against DSQL.
My intention in sharing this article is help you understand the flow. There<br>are many other things to understand: availability, scalability, durability,<br>security, and so on. I won’t be discussing those in detail, because each of<br>those topics is deep and complex, and deserves its own focus.
The flow of data
Aurora DSQL is based on PostgreSQL. The Query Processor (QP) component is a<br>running postgres process, although we’ve significantly modified it to work with<br>our architecture.
If you connect to DSQL with a Postgres client (such as psql), you’re connected<br>to one of these postgres processes. You’re connected to a QP, and you can start<br>to interact with it as you would with any other Postgres server.
If you run a local Postgres operation, such as SELECT 1, then that query is<br>processed entirely locally by the QP. But what happens when you query a table:
select * from test;<br>id | value<br>----+-------<br>1 | 10<br>(1 row)
Usually, you’d expect Postgres to read from storage locally, which might mean<br>reading from the buffer cache, or doing disk I/O. When running on Aurora<br>Postgres (APG), cache misses would result in a load from remote storage.
Like APG, reads in DSQL also go to remote storage. In our above query, which<br>is a scan of the entire test table, the QP is going to turn around and scan<br>storage, and storage is going to return all the rows in the table.
But how did storage get the rows in the first place?
insert into test values (1, 10); -- autocommit
Vanilla Postgres would process that transaction locally, inserting into the<br>Write-Ahead Log (WAL), updating the buffer cache, and using fsync() to persist<br>the changes to disk. In APG, the buffer cache is also updated, but the<br>durability of the transaction is ensured by fsync() to the remote storage in<br>multiple Availability Zones (AZ).
Commits in DSQL have the same basic ingredients, but they’re expressed quite<br>differently. In DSQL, data is durably persisted when it’s written to the<br>journal1. Storage follows the journal, and keeps itself up to date.
When I first started working on DSQL (many years ago!), I didn’t really get<br>this flow. I’d been told “writes go to the journal, reads go to storage”. I<br>nodded, but I didn’t deeply, truly, understand that simple explanation. I’d<br>spent too much time with traditional architectures, and my mind kept falling<br>back on the familiar.
What helped me get it was the picture at the top of this post. Imagine somebody<br>drawing this on the whiteboard. They draw the three boxes: QP, journal and<br>storage. Then, they draw the Circle of Life:
There’s something about this presentation, vs. the one at the top of the<br>article, that helped it go click for me. Removing the service interactions<br>certainly helps. Notice how in the first picture there’s an arrow from the QP<br>to storage, while in the Circle, the arrow is the other way round?
“Writes go to the journal, reads go to storage”, never quite did it for me.<br>“Writes never go to storage, reads never go to the journal” also didn’t quite<br>the message across. The Circle did for me, and I hope it does for you.
The flow of time
Now you may be thinking: How do we know that storage is up to date? We’ve just<br>inserted our (1, 100) tuple and got a successful commit. Then, we run our<br>table scan. What if storage isn’t up to date? What if there’s some kind of<br>delay on the network, preventing storage from learning about the new row?
The change is trying to reach storage, but it’s stuck in traffic:
The answer to this question is quite beautiful, and it’s one of the things I’m<br>most excited about with DSQL. Because the answer is absolutely not “eventual<br>consistency”.
You see, it’s not just data that’s flowing around the Circle, it’s time too.<br>Every transaction has a start time Tstart. This time comes from EC2 Time<br>Sync, which provides us with microsecond-accurate time. When the QP queries<br>storage, it doesn’t just ask “give me all the rows in the test table”.<br>Instead, it adds “.. as of Tstart”. When the QP writes data to the journal,<br>it computes the commit time and then says “store this data at Tcommit”.
The journal provides an ordered stream of both time and data, which means<br>storage can know precisely when it has all the data to answer the query.
As somebody who’s spent an awful amount of time debugging and trying work around<br>bugs caused by eventual consistency, I really cannot overstate how delighted I<br>am with this design property. In DSQL, you never have to...