The software industry: annealing, but wrong - apenwarr
You'll like it
We promise
Everything here is my opinion. I do not speak for your employer.
← March 2026
2026-05-31 "
The software industry: annealing, but wrong
In recent months I've heard of several teams with an interesting policy:<br>each pull request should be no more than a few files, and no more than a<br>certain number of lines (say 500). And do just one thing and do it well. And<br>be easy for a human to review. And be fully tested by the test suite.
All those are good requirements, right? Surely this is quality software<br>engineering.
And often, the results are good. Sure, splitting a single 6000-line feature<br>or fix into twelve 500-line PRs is more work, but each of those PRs is<br>surely easier to review. And you can git bisect them when there's a bug! And<br>maybe revert the individual change that broke something.
...and also cause 12x as many context switches for your<br>reviewers as they<br>review each one sequentially.1 But that's just the cost of<br>software quality! Right?
Mostly, yes. My analogy here is simulated<br>annealing. In that<br>process, you start your problem solving with a high energy -- making big<br>changes to move quickly through the problem space -- and then slowly reduce<br>the energy level so that your "hops" get smaller and smaller. In real<br>physical annealing (used eg. for metallurgy), the result is stronger, more<br>stable, more crystalline structures. In simulated annealing, you use it to<br>find solutions that aren't obvious, by rapidly exploring the solution space<br>and then zooming into the areas that look most promising.
In software the analogy is clear: sure, you might start with big jumps, but<br>once your system is more mature, you should make smaller jumps. Big<br>jumps break the crystalline structure. They cause bugs.
Fear of breaking the crystalline structure sounds cooler than fear of<br>change
The main problem with annealing-driven intuition happens when things do<br>need to change quickly. It's not made for that. You usually don't build a<br>hammer and then decide one day you want it to be a different shape. But<br>every day, there are compelling-sounding reasons to make your software a<br>different shape. Annealing is the enemy of change.
Modern AI-driven coding (ironically, with LLMs trained using a process quite<br>similar to annealing) does not care about your annealing and your risk<br>management and your fear of change. It produces changes as big and<br>interconnected as you want, jumping all over the solution space as quickly<br>as you can prompt. And it has all the outcomes the math would predict:<br>the output is less strong, less coherent, more likely to fail. LLMs have no<br>fear of change because the LLM instance will be long gone before the<br>consequences materialize.
But, it's a new and special feeling to suddenly be able to take a large,<br>mature code base and suddenly explore any kind of large change you want.<br>Most of those changes turn out to be bad ideas... and it's nice to be able<br>to discard bad ideas quickly. But some turn out to be good ideas. Then what?
Well, follow your development processes. Break the big changes into 500-line<br>patches. Review them one by one. You already did the research! You know it's<br>worth it.
Not every big step is made of small steps
But it's not about being worth it -- some changes simply don't lend<br>themselves to small steps.
In the early development of Aperture, I<br>wanted to implement dollar-based spend quotas: across all your LLM backends,<br>let a given team or person or node spend up to $x per unit time. But to do<br>that, we first had to add pricing information (it's mysterious how LLM<br>vendors don't to tell you how much your queries cost), which meant assigning<br>prices to provider definitions, and then we had to assign quotas to<br>particular identity+model+session combinations. And quotas are one of the<br>first key value propositions of Aperture. We had to have them, but we had to<br>have all that stuff.
So, I made a giant change that included three major areas: first, the Grant<br>syntax for applying attributes to sessions; second, a query cost<br>approximator that combined multiple sources and a messy heuristic; third,<br>the actual quota enforcement system. Each of these parts was imperfect, but<br>we needed all three parts in order to make anything work at all, before we<br>could refine them. That's the high-energy big-jump part. It came out to<br>something like 12000 lines of code.
Now, I'm not a monster. After I made it all work, I split it into three<br>parts: the grants, the pricing, the quotas.2 Otherwise it really<br>would have been an unreviewable mess. But also, I could not have developed<br>the quotas feature in real life in that artificial order. The grants<br>structure evolved as my understanding of pricing and quota enforcement<br>evolved. The original quota semantics sucked, so I rewound back to the data<br>structures, which affected how the pricing got imported, which changed how<br>the quotas were stored. The code reviewers didn't have to worry about...