Visibility into the Black Box

Visibility Into the Black Box | matgreten.dev

← home

I ship production code through an AI pipeline. Not to a toy project — to a ten-year-old Rails monolith that serves real users, reviewed by teammates who genuinely care about craft. They love OOP. They spot code smells three abstractions deep. They'll tell you when you're building something today that makes tomorrow harder.

One thing I hear often: these tools are really good at adding code, but not very good at deleting it. The value isn't just in what we ship — it's in what we choose not to ship. That's a hard standard to codify for an AI agent, but it's the standard the code has to meet.

80% of the time the pipeline gets me 90-95% of the way there. I make a few tweaks, respond to feedback, and it ships. I've seen a literal 12x increase in my PR throughput moving from using coding agents as assistance to running this pipeline. The other 20% is where I'm blind. When a run fails, when a tweak doesn't help, when I change a model and the patch rate shifts — I'm guessing. The pipeline works in a high-ownership, high-craftsmanship context, and I need visibility that matches that context.

That's what drove the migration to swamp. Make every decision the pipeline makes observable, so I can iterate on it with the same care my team expects from the code it produces.

What ADW is and Why it Exists

ADW — Agentic Development Workflow — is a pipeline I built for my own workflow. It's not a team tool (yet). It's one developer's attempt to automate the mechanical parts of shipping code against a production codebase that serves real users.

This is all inspired by IndyDevDan's: AI Developer Workflow that he uses to teach agentic engineering principles through in his course Tactical Agentic Engineering.

The pipeline takes a plan and runs it through a series of phases:

Ideation — I describe what I want to build (often a messy brain dump), and the system explores the codebase, scores its own confidence, generates a structured contract, and runs an adversarial challenge against the plan before any code is written

Convert — breaks the plan into atomic user stories with acceptance criteria

Worktree — creates a git worktree, sets up Docker containers, seeds the environment

Build — spawns AI agents to implement each story, commits the work, manages parallel task groups. Think Ralph loops, but way more fine grained with AC being validated by a separate agent.

Test — runs the relevant spec suite, identifies failures, loops back for fixes

Review — a separate agent reviews for correctness, OOP violations, query performance. Blockers trigger patch-and-re-review cycles.

Ship — submits PRs via Graphite, adds descriptions, handles submission failures

Artifact — a walkthrough review artifact is created to help me get back "in-the-loop" after the agents are done cooking.

The whole thing is about 10,000 lines of Ruby.

But it's not even close to done. I'm actively dogfooding every part of it, and the feedback loop is constant. Every PR teaches me something — a standard that isn't codified, a pattern the agents miss, a review heuristic that catches real issues. Part of building ADW has been codifying our team's standards so that agents can follow them more consistently. The agents surface gaps I didn't even know existed in our documented conventions.

One big contention that I personally see between frontier agentic software development and the type of code base I'm working on is standards around atomic PRs and changes being bite-sized so that they're easy for humans to review. In the short term, the project I'm working on with this is not going to be able to just let agents ship gigantic PRs. And so we need a way to break up the work do it into logical chunks for review. That's where graphite comes in. And stacking PRs makes this much easier than before when I wasn't using Graphite.

All that to say it's taking what I was doing and would take me months and I can produce similar code, if not better in a morning ADW run. Code that mostly adheres to our standards and awaits the human review.

The Ideation Phase is Still the Most Critical Part

As more of the implementation becomes automated, the quality of the input matters more, not less. A bad plan implemented perfectly is still a bad plan. The ideation phase — where I describe what I want and the system explores the codebase, scores confidence, and challenges the approach before any code is written — is the highest-leverage part of the entire pipeline.

Right now, ideation happens in the Claude context and leaves no structured trace. I can see the contract it generates. I can read the adversarial challenge. But I can't query across 50 ideation sessions to ask "how often does the confidence score predict actual build success?" or "which types of problems consistently score low confidence and then fail?" That's a candidate for swamp — and maybe the most valuable one left.

The pattern holds: as agents handle more of the...

Visibility into the Black Box

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

SpaceX not the behemoth everyone thought

The Mirror Is Part of the Machine

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits