Draw the rest of the ___ owl (software factory edition)

June 12, 2026

My guess is that most software development teams in the world are at a point where we are drawing the two circles (Don’t quote me on that. I don’t have data to back that up). Those are all the teams where developers use LLMs daily.

Some may only use them as assistants (e.g. asking them questions). But others no longer open their editors much, if at all. They mostly interact with the LLMs through a CLI tool like OpenCode, Claude Code, or Codex, and have the LLM write all the code, even if that means just fixing a typo.

Still, I consider all of that to be on the “draw some circles” side.

But then you hear about teams that might be getting closer to buiding software factories, like the team described in Simon Willison’s blog post titled “How StrongDM’s AI team build serious software without even looking at the code”.

And that’s the “just draw the rest of the ___ owl”.

There’s a monumental gap there. And that’s the gap I’m trying to understand.

How can we trust the code without looking at it? (And does that change as models get better?)

How can we verify that the code is doing what we want it to do?

How can we verify that the code is not doing what we don’t want it to do? (e.g. secretly phoning home)

How can we make sure we’re building software that will be able to change over time? (We don’t want software that calcifies past the first few versions, where a change breaks a seemingly unrelated piece of code elsewhere)

How can we make sure the LLMs aren’t writing code that is slowly and silently decaying until a final day of reckoning? (We don’t want to be like the turkeys that are happy and well fed for 364 days of the year, only to be butchered on Thanksgiving Day)

How do we draw the rest of the owl?

If we’re going to treat the code the LLMs write as a black box that we never read, then all we can know about it is the inputs we provide, the outputs it creates, and the signals we get from what Birgitta Böckeler’s blog post calls “sensors”.

This is how I envision one factory line (in our larger software factory):

Context is not king

The inputs are the prompts and any context we provide the LLMs. Those matter. But I’m not convinced more context is what gets us to software factories.

Some people argue that if we can just provide a full specification, the LLMs will build it right. I disagree. I’m not convinced a 30-page tome that we create in “plan” mode is really how we get software factories. A better plan doesn’t help us know if the LLM has built the right thing, especially as more features are added over time.

No, if we’re going to build software factories, I think it all comes down to verification.

We need to make sure that the software we’re building behaves like we want it to, while also making sure the codebase can change over time without decaying.

Sadly, I see some people reverting to doing a lot of manual testing (e.g. they test their changes via some UI and say, “looks good”). Manual testing is fine, but it suffers from the problems it’s always suffered: it’s slow and expensive (and therefore cannot cover all possible scenarios), and it’s not easily repeatable (so it’s easy to miss regressions). That’s why automated testing is so much better — it’s fast, cheap to run, deterministic, and repeatable.

So, if we cannot verify the behavior by just clicking around the app, what do we do?

As someone who’s been practicing TDD/BDD for over a decade, my first inclination was to treat tests as a verification tool. After all, automated tests should verify the behavior of our application.

But if the LLMs are the ones writing the tests inside the black box, how can we verify the verifying mechanism?

We need more sensors

This is where I think we need to make the most progress. If it is indeed possible to create good software factories, we need more information about what’s happening inside the black box.

Remember, we want signals that help us answer questions like these:

Is the application behavior correct?

Are the automated tests the LLMs write actually covering the code they created?

Is the code architecture evolving well?

What’s the health of the code? Is it decaying?

Is the code performing unwanted side effects?

So far, my preferred flow includes writing Gherkin scenarios that act as an invariant the LLMs cannot break. I couple that with code coverage to make sure the tests are exercising new and existing code paths. And I’ve started using mutation testing to ensure those LLM-generated tests aren’t fake.

For high-level architecture, I’ve been relying on DDD-style bounded contexts by checking module boundaries with libraries like Boundary for Elixir. That architecture should not change without human approval.

I’m also interested in exploring things like TLA+ to see if we can have a more formal specification of the runtime temporal logic. But I haven’t...

Draw the rest of the ___ owl (software factory edition)

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs