Models Have Blind Spots: Debugging Unfamiliar Code with a Multi-LLM Loop – Barrett Sonntag
Skip to content
Thanks ChatGPT for the Graphic
Pasting a hard bug into one AI prompt feels productive until it isn’t. Single-model inference hits a ceiling fast; if the model misses the root cause on the first pass, it will cheerfully validate its own wrong answer forever. One way out is to act as human middleware between multiple, architecturally different LLMs: generate parallel hypotheses, swap their outputs for cross-review, and force them to argue until the overlapping signal emerges. It’s more labor than a single chat window, but it’s the difference between a confident hallucination and a fix that actually ships.
There was a ticket sitting in the queue for a throughput chart that kept breaking. The customer’s description was vague enough that our support person couldn’t fully parse what was happening at first, something about the graph going haywire every thirty seconds, metrics scrambling, then correcting itself. A coworker eventually spotted intermittent NaN errors in the console, which gave us a thread to pull. But I didn’t know this codebase, and nobody else had picked it up.
I’d been doing some prompt engineering work on the side and wanted to test how far I could push AI-assisted debugging on a real problem. So I grabbed the ticket. A single-model prompt gave me a reasonable hypothesis, but I wasn’t confident enough in it to start cutting code in a codebase I’d never touched. The bug didn’t really crack until I started running a loop between multiple LLMs; that’s when the hypothesis sharpened into something I could trust. That’s the process worth sharing.
Why a single model isn’t enough
The instinct is reasonable: paste the error, paste some code, ask the model what’s wrong. And for a lot of bugs, that works fine. But intermittent failures in unfamiliar code are a different species. The model doesn’t know the codebase any better than you do, and if it latches onto the wrong causal theory in its first pass, every follow-up response reinforces the mistake. Researchers call this self-anchoring, where the model’s confidence drifts upward even as its accuracy stays flat. It’s an echo chamber of one.
Homogeneous scaling doesn’t help either. Running three instances of the same model and asking them to debate produces redundant reasoning and artificial consensus. The errors are correlated because the training data and architecture are the same. You get three copies of the same blind spot.
The way past the plateau is architectural diversity. Models trained on different data, with different reinforcement pipelines, produce uncorrelated error distributions. A hallucination or logical miss by one model is structurally likely to be caught by another. But the models can’t coordinate on their own; you have to do that part.
Step 1: Gather your clues before you prompt anything
Resist the urge to open a chat window immediately. You need a baseline, something observable, reproducible, and specific enough that any model can reason about it without guessing.
For my throughput bug, the baseline was:
A temporal pattern : the chart broke on a strict 30-second cycle. It fractured around second 17 and restored around second 47.
A console artifact : intermittent NaN values appearing in the data pipeline.
A visual symptom : SVG path elements rendering with corrupted coordinates.
That’s three concrete anchors. Without them, you’re asking the model to theorize in the dark, and it will happily oblige with something plausible and wrong.
Step 2: Generate parallel hypotheses with different models
Feed the same clue set, symptoms, console output, relevant code files, to two or more architecturally distinct models. I used Claude and Codex. The point is not to get the same answer twice; it’s to get different answers.
Both models identified the NaN propagation as the core failure, but they diverged hard on where to assign blame. One favored aggressive upstream data sanitization, fix the numbers before they ever reach the chart. The other favored strict encapsulation at the rendering boundary, let the chart defend itself against bad data. Each produced a different TDD plan.
This divergence is the signal, not the noise. If both models agree immediately, you might have a straightforward bug. If they disagree on root cause or fix strategy, you’re dealing with something layered, and the disagreement itself maps the territory.
Step 3: Cross-pollinate and force critique
This is the core of the process, and it’s where the labor lives. You become the middleware.
Take Model A’s analysis and TDD plan. Hand it to Model B, not as a prompt to build on, but as an artifact to critique. "Here is another model’s analysis of the same bug. Review it for logical gaps, missed failure modes, and risks." Then do the reverse: take Model B’s output and give it to Model A.
I brought Gemini and ChatGPT into the loop as diagnostic critics. I gave...