The Shape of the Whole
Browse<br>ForewordManifestoGlossaryGuidesPosts<br>The companion<br>PreambleIDraw the graphIIFind the loop that will not let goIIIName the invariant no service ownsIVCompose the control loops on purposeVProbe the wholeThe through-lineAppendix<br>The Shape of the Whole<br>Model over memory: Engineering the failures no component owns
Companion to The Shape of the System. Read that first. Roman numerals in brackets, like (XIX), point to the parent's tenets; the five laws here are named.<br>There are two ways to ask whether a system will hold. One of them starts from the part and works up: you make each piece correct on its own, you get the contract right at every boundary, and then you trust that well-built parts and honest hand-offs add up to a sound whole. That is the manifesto, and it is local by design. The other one starts from the whole and works down. You begin with what the entire system must never do and you decompose from there, asking how parts that are each behaving exactly as built can still combine into a failure. That is reliability engineering, and it assumes the opposite thing. Not that good parts will hold, but that something always gives, and the work is bounding what it takes with it when it does.<br>The two ways meet in one place, and they get there from opposite ends. It is the failures that live in the composition and in no single part. From the bottom up that is the far edge of the map, and you are walking away from it, so calling it out of scope is not negligence, it is just the honest limit of a local view, and the manifesto stops there on purpose. From the top down it is right there in the middle of the page. It is the first question you ask.<br>The objective does not change to meet these failures. It is the parent's, word for word: minimise what a tired engineer must hold in their head to make a correct change, subject to keeping the blast radius bounded for anything an attacker or an unlucky caller controls. The move does not change either. You push correctness out of vigilance and into structure. What changes is only the type of structure. A type makes an illegal value unrepresentable. Here the structure is a model of the composition - a dependency graph, a capacity model, a consistency contract - and it makes a bad arrangement impossible to build where you can rule it out, impossible to miss where you cannot, and bounded when it shows up anyway. Some of these failures you design out and the rest you design to contain. The graph is a type whose values are topologies. The capacity model is a type whose values are loads. The consistency contract is a type whose values are histories. They are structure in the same sense a schema is. They hold whether or not anyone remembers them, and they keep the shape of the whole somewhere a reader can go and look instead of having to reconstruct it. The structure is just a different type, and that is all of it. Externalise, don't memorise.<br>None of this is free, and here the cure is most often the disease. Every model is itself state, and state drifts, and a stale map of the system is worse than no map at all, because it lies to you in the one moment you reach for it. And every mechanism that enforces a property across components - a coordinator, a consensus group, a saga - manufactures the coupling, the latency, and the fresh cascades the parent spends twenty-five tenets trying to contain. So each law names four things: the failure no component owns, the model that makes it tractable, the cost the cure itself imposes, and one question to ask before you ship. Reach for the machinery of the whole only when the property is real and violating it costs more than coordinating it. None of it is free.
IDraw the graph: independence is a claim, not a default<br>The components you call independent fail together through the substrate you forgot they share; map it, or the outage maps it for you.<br>Two services on separate hosts, written in different languages, owned by different teams, will feel independent right up until both of them resolve the same DNS, or read the same config service, or trust the same certificate, or sit in the same availability zone, and then the one thing underneath takes them both down in the same instant. Independence is not the resting state of a distributed system. It is a property you engineer and then verify , and common-mode failure is the bill for assuming it came for free. The shared substrate has a worst form and a funniest one. The worst is the DNS that withdraws its own routes when it cannot reach home, and then cannot be reached to undo it. The funniest is the status page served from the very thing whose outage it exists to announce. The same graph hides the other shape too, the cycle: A needs B needs C needs A, invisible from inside any one service, and a deadlock the moment they all boot cold or contend for one pool under load. The graph, not the service, is the unit of reasoning here, and the parent's bulkheads and cells...