When Impressive Performance Gains Do Not Matter
Of anything I’ve worked on in my career, performance work has been the most rewarding. I enjoy making systems more efficient, especially when it opens up brand new possibilities for customers. I also find developing an empirical understanding of systems is one of the best ways to learn how systems work from first principles, especially how complex systems interact, at scale, or under load. But one of the greatest benefits of performance work is the creativity that comes from working intimately with systems. Through performance work, I find people develop a wealth of ideas for how products and services can be improved, most of which are not even related to performance optimization.
While improving performance always feels good, impressive claims like “10 times faster” or “an order-of-magnitude more efficient” or “fifty percent fewer resources” may not have the impact you anticipate due to constraints that are not always obvious or intuitive. This is an essay about three of those constraints.
Attention Threshold
Recently, I worked on improving the query performance of a new database that returns data to a user interface for graphing and interactive analysis. We were developing the new database with the goal of improving response time by an order-of-magnitude over the existing database that had been used for many years. The most expensive queries against the old database took between 5 and 10 minutes. After months of difficult engineering, we got the same queries to complete between 30 seconds and 1 minute—an order-of-magnitude improvement.[1] A presentation to management highlighting these performance gains would look very impressive—queries that used to take 10 minutes now return in 1 minute. However, I insisted it wouldn’t have the impact we wanted unless we squeezed out an additional order-of-magnitude.
Human-factors research identifies 10 seconds as the limit for keeping someone’s attention.[2] For delays longer than this, people will perform other tasks while they wait.[3] Therefore, even though a query that used to take 5 minutes now took 30 seconds, both were well above the 10-second threshold of attention. In both cases, people will context-switch—check their messages, go for coffee, start a conversation, start another task. When they finally return their attention a few minutes or hours later, the user interface will have loaded, but the time it actually took is immaterial.
Ultimately, if we could not complete queries in under 10 seconds, our performance improvements would not have an impact on changing the way people work. In complex systems, improving performance by an order of magnitude is often an incredibly difficult feat.[4] Sadly, we needed another order-of-magnitude improvement—queries had to complete in under 10 seconds to hold users’ attention.[5]
Going From One to Two
Years ago, I worked on a project where we made incredible gains in efficiency by automating manual tasks, removing unnecessary steps, parallelizing parts of the process, and deferring steps that could be completed later, asynchronously. It improved the overall process from a few hours to reliably under an hour—somewhere between a 25 to 50 percent improvement. We were understandably excited about this impact.
As it turned out, this improvement in software performance didn’t impact the overall process because it was constrained by logistics. To demonstrate, consider a plumber, an electrician, or a carpenter. They each need to schedule work at a location, travel to that location, and then complete the work. For the sake of argument, if they work 8 hours in a day, and it takes 8 hours to complete the work at a location, then it doesn’t really matter if a process improvement just saved 2 or 3 hours, because there still isn’t enough time to travel to a new location and complete a new job. If you can’t get each job below 4 hours, including travel time, then you can’t complete two in a day. Breaching thresholds like this can be incredibly difficult and the efficiency gains along the way don’t pay off until you do. Going from one to two can be incredibly hard.[6]
Backpressure in Pipelines
The software infrastructure for many businesses includes data pipelines where events are produced from many different sources—vehicles, factory equipment, mobile phones, financial transactions—then processed reliably to drive many other services and applications. The events are usually persisted to a durable log from which downstream services consume and process events. To achieve high throughput at scale, the log must be partitioned and the downstream services use techniques like batching, pipelining, parallelism, efficient memory allocation, dynamic scaling, and more.
Performance bottlenecks in data pipelines can be hard to find because the system dynamics are correlated. A slow stage in the pipeline will backpressure to the upstream stages, by design.[7] If there are multiple...