Chapter 3: Time and Order in Distributed Systems | codedump notes
Home/Distributed System Illustrated/<br>Chapter 3: Time and Order in Distributed Systems
Chapter 3: Time and Order in Distributed Systems
June 25, 2026
Table of contents
In distributed systems, multiple nodes work together. Client requests are sent to different nodes for processing, and these requests become events on individual nodes. As we will see, the system’s state is formed by executing these events one by one in a specific order. Therefore, the order of events is particularly important. Different nodes may see different sequences of events, which can lead to different states on these nodes.
From this, it is evident that when processing multiple events, the order in which they occur affects the state of nodes. Thus, how to measure the order of events becomes a core problem in distributed systems. A natural idea is to sort events according to their physical time. In the following sections, we will learn about the principles of physical clocks and, unfortunately, discover that in a distributed system, comparing physical time across multiple nodes to determine the order of events is not precise and can sometimes even lead to errors. Once the order of events is disrupted, the state of the entire distributed cluster becomes corrupted.
If physical time is not feasible, what other method can be used to determine the order of events in a distributed system? The answer is logical clocks.
Surprisingly, although we have repeatedly emphasized that the order of events is particularly important, in distributed systems there may still be cases where the order of occurrence of two events cannot be determined. Such events are called “concurrent events.” To explain event ordering, we also need to understand two mathematical definitions: partial order and total order. In fact, these two mathematical definitions should be deeply rooted in the mind of every distributed systems engineer, and we will encounter their applications in distributed systems again and again in later chapters.
In this chapter, we will discuss the following topics:
How the order in which events occur affects the state of the system;
Why physical time cannot be used to measure the order of events in distributed systems;
The theoretical foundation and computation rules of logical clocks;
Intuitive explanations and definitions of partial order and total order;
Vector clocks, an extension based on logical clocks.
State, Events, and Snapshots
In some crime dramas, we often see scenes like this: police are tracking a suspect who suddenly changes their behavior one day, so the police pull up records to review where the suspect went, who they met, and what they did over the past day, in order to understand what caused the suspect’s change.
Here, “places visited, people met, and things done” are individual events. If we view the suspect as a system, the suspect from yesterday and the suspect after these events represent the system’s states at different points in time. (From this, we might ponder a philosophical question, as shown below: the “me” of the past, after going through a series of events, becomes the “me” of the present—are these two “me’s” the same “me”?)
Let us continue explaining state and events with a simple storage service. As shown below, a storage service initially has state $\{x = 1\}$. After sequentially executing commands $set\ x = 2$ and $set\ x = 3$, the new state becomes $\{x = 3\}$. Here, the execution order of these two commands is particularly important; if the order of these two events is reversed, the resulting state becomes $\{x = 2\}$.
From the examples above, we can also see that we have not mentioned the physical time of event execution, but rather focus more on the order of event execution. The reason for focusing on the relative order between events rather than physical time is that there is no globally unified physical time in a distributed system. If we view the system as a large state machine, events are the operations that change the state of this state machine. In a state machine, as long as we can guarantee that events are executed in the same order every time, we can ensure that the system always reaches the same state.
Executing the same events in the same order produces the same result—this is the core idea of state machine replication [1]:
If two different processes start from the same initial state and process input data in the same order, they will produce the same output.
Note : After discussing logical clocks, we will see that the execution order of events is, in a sense, logical time.
On the other hand, the instantaneous state of this storage at different points in time is called a snapshot at that point in time. For example, taking a photograph on a busy street captures a snapshot of that street at that exact moment. In...