A distributed systems reliability glossary

A distributed systems reliability glossary Product

Solutions

Docs Company

Learn

Book a demo

What is Antithesis? How Antithesis works How we're different Problems we solve Security approach Fintech Blockchain Databases Cloud infrastructure Customer stories Working with Antithesis Contact us Backstory Leadership Careers Brand Distributed systems reliability glossary Cost of outages white paper Deterministic simulation testing primer Property-based testing primer Autonomous testing primer Techniques to improve software testing Catalog of reliability properties for key-value datastores Catalog of reliability properties for blockchains Test ACID-compliance with a Ring test Introduction

This glossary is an overview of the concepts that you’ll need to think about distributed systems reliability. We’re writing chiefly for industry practitioners – software developers who are learning about distributed systems testing at any stage of their careers.

It’s meant as a handy guide, bringing together information that was previously scattered all over the internet — because the concepts here originate in many different disciplines (and naturally everyone’s too shy to talk to people outside their field, us included). To the best of our knowledge, it’s the first resource to do so. At the same time, we hope that simply putting all these ideas together in one place starts to show how they all fit together.

But! And we cannot stress this enough — this is a reference, not required reading!

We’re not saying you need to understand every one of these concepts in order to test a distributed system. Every time you write an integration test, you’re testing a distributed system already! This glossary is here to encourage you to get deeper into a topic that’s increasingly important for every developer committing production code today.

So our goal is to provide intuitive explanations, with pointers to more formal definitions should you need them. We present clear, univalent definitions of terms that are actually messy and contested, like “process,” “repeatable read” or “eventual consistency.” In such cases, we attempt to nod at the diversity of definitions and usages that exists, but our priority is to give a reader something that’s directionally correct and actually useful for a learner.

We’ve also included essential concepts for which there are no formally defined or widely accepted terms in existing literature, like “garbage reads” and “g-nonadjacent.” Maybe the names will stick?

We know it’s incomplete, and if you care about software reliability or distributed systems, we’d love your help!

This glossary is organized as follows:

Preliminaries: concepts used in defining phenomena and consistency models.

Consistency models: which define what systems are allowed to do.

Availability models: which describe different ways systems can be available.

Phenomena: something a system does which someone, somewhere, thought was a bad idea.

Faults: fault models to which you might want your system to be resilient.

Testing techniques: ways to test whether your system actually obeys these models, or experiences these failure modes.

Further reading: a reading list of key reading lists.

Regardless of whether you’re working on your first distributed system or your fiftieth, we hope this will help you make it more reliable.

Yours, Jepsen & Antithesis

Preliminaries

These concepts are used often in defining phenomena and consistency models.

Dependency In consistency models, a dependency is a relationship between two operations (e.g. transactions). For example, a single process could execute one operation before another: a process dependency. One operation could read data that was written by another: a write-read dependency.

Definite error A definite error is returned by an operation which definitely did not happen. For instance, a transaction abort error is usually a definite error: the state of the system should be as if the transaction never happened at all. By contrast, an indefinite error may mean that the requested operation did or did not happen, or might happen later.

Indefinite error An indefinite error is returned by an operation which may or may not have happened, or might happen later. For instance, a timeout is an indefinite error: the operation may not have been received at all, or it may have taken place without an acknowledgement, or it may be in-flight and execute five minutes later. By contrast, a definite error is known to have not executed.

Distinguishing between definite and indefinite errors is a key challenge in distributed systems design and testing. If one writes a unique value x = 3, receives a definite error, and later reads x = 3, that very likely signals an invariant violation. If the write receives an indefinite error, it is legal to read x = 3 now, at some later time, or never at all. Checkers must account for all possible outcomes.

Object In consistency models, a database usually contains a...

A distributed systems reliability glossary

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy