Amdahl’s Law and the Limits to Growth | Guido Steinacker
A search service processes 350 requests per second. The server has 16 cores and the CPU utilisation is at 40%. As traffic is growing, the team believes there must be room for improvement.
They double the thread pool to 400 so that more requests can be processed simultaneously. Throughput rises from 350 to 410. With 800 threads: 425. With 1,600, it rises to 430. More threads mean almost no gain. It’s enough to make you tear your hair out!
Why is the curve flattening out? A thread dump or a look at the wall-clock profile in the profiler provides the answer: hundreds of threads are blocked. They are waiting for connections, locks and GC safe points.
Not all processes can be fully parallelised when handling requests. Each thread requires a connection from the pool, and the database uses locks to serialise concurrent accesses at the other end. The in-memory cache synchronises via a mutex. More threads mean more object allocation and thus more frequent ‘stop-the-world’ garbage collection pauses, in which all threads are halted regardless of how many cores are available. Even modern collectors such as ZGC or Shenandoah cannot eliminate these pauses completely; they merely shorten them.
In our scenario, the search service is still a monolith that communicates with a database. In distributed systems, synchronous calls to downstream services are added – we will come to this later in the series.
All of these are serial fractions – parts of the system that every thread must pass through individually. And there is a formula that describes exactly how severely these sections limit scalability.
The upper limit
A painter is painting a flat. He spends two hours masking off the edges and sockets. He has to do this alone because each edge depends on the one before it. In theory, at least. Then he spends eight hours painting the walls. He can get help with this: two painters can do it in four hours and four in two. But the two hours of masking remain. Even with a hundred painters, the job would still take at least two hours – the serial nature of the process sets the lower limit.
Gene Amdahl formulated this in 1967. He considered a program that could be executed partly sequentially and partly in parallel and asked: ‘If I use $n$ processors, how much faster will the program run?’
The answer can be summarised in a single sentence. The maximum speedup is the reciprocal of the serial portion. For example, if a tenth of the program is serial, the maximum speedup is a factor of ten. If a quarter of the program is serial, the maximum speedup is a factor of four. This remains true no matter how many processors are used. Formally, it looks like this:
\[S(n) = \frac{1}{s + \frac{1-s}{n}}\]
$s$ is the proportion of the program that must be executed serially – i.e. cannot be parallelised.
$n$ is the number of parallel units (cores, threads, instances, renderers, etc.).
$S(n)$ is the resulting speedup.
What happens as $n$ approaches infinity? The term $\frac{1-s}{n}$ approaches zero, leaving:
\[S(\infty) = \frac{1}{s}\]
The figures are as follows:
Serial proportion $s$<br>Maximum speedup
50 %<br>2×
25 %<br>4×
10 %<br>10×
5 %<br>20×
1 %<br>100×
No matter how many cores or threads you add to a 25% serial system, it will never run more than four times faster. This is not a question of implementation, but a hard limit. If we aren’t aware of this, we’re just throwing money at cloud providers. Pointlessly.
A common pitfall here is reading $s$ from the profiler of a parallelised application instead of measuring the original execution time on a single core. If the profiler shows ‘5% serial time’ with 16 cores, this does not mean that $s = 0{,}05$ – after all, the other 95% is already distributed across the 16 cores. The actual serial proportion is significantly higher.
Amdahl’s law therefore describes an upper limit for the speedup, i.e. the best-case scenario. In practice, however, things often turn out worse because the coordination overhead between parallel units can actively reduce throughput, not just cause it to stagnate. The team working on the search service was fortunate in that the throughput merely stagnated. In other systems, such as when many threads compete for the same database rows or retry storms flood the network, throughput can plummet dramatically.
Originally, the law describes the acceleration of a fixed workload. However, the same mechanism also applies when many requests per second pass through a system. Each request is a fixed job that must pass through the same serialisation point, which is why Amdahl’s law applies to both the latency of the individual request and, via Little’s law, to the overall throughput. Little’s law links parallelism, throughput and response time. This is precisely what the search service experienced at 430 req/s, even though there was still free capacity on the cores.
Digression: The Universal Scalability Law – Amdahl is a hopeless...