I had Codex iterate on its own AGENTS.md 8 times and measured each version against real PRs. The best one still regressed on a clean holdout. — Stet
I had Codex iterate on its own AGENTS.md 8 times and measured each version against real PRs. The best one still regressed on a clean holdout.<br>May 27, 2026
I have a confession: I vibe-coded my AGENTS.md, and I'm pretty sure it's slop.
I needed to make it better. Naturally, I asked Codex to do it.
The difference: this time, Codex used a benchmark on my repo to measure each change, and optimized AGENTS.md against the data, instead of on pure vibes.
Why We Should Take AGENTS.md Seriously
Saying "AGENTS.md is important" is, at this point, a cliche. At risk of beating a dead horse, I'll say it again.
Someone adds a rule that sounds smart, senior, and reasonable, commits it, and hopes the agent behaves better. But AGENTS.md, CLAUDE.md, and shared skills are not normal docs. They are part of the runtime behavior of your coding system.
The shift is to start treating AGENTS.md like a tunable part of the harness: holding everything else the same, how does agent behavior differ when I change AGENTS.md? That's what I measured.
The Results
After eight candidate runs, one version looked useful on a five-task training slice. It fixed the task the baseline missed, improved footprint risk, and moved several craft scores up.
Then I ran it on a clean ten-task holdout. The candidate regressed. Not catastrophically, but enough that blindly shipping would have been wrong. Footprint widened, tokens climbed, tool calls climbed, and code-review correctness fell, all while tests held even.
Caveat: one repo (mine), n=10 on the holdout. This is directional, not statistically significant.
For this post, "equivalent" means the patch matched the intent of the merged human PR; "code-review pass" means an AI reviewer judged it acceptable; craft/discipline is a 0-4 maintainability/style rubric; footprint risk is how much extra code the agent touched relative to the human patch.
The pattern is the agent doing more work for mixed outcomes - better on local craft (clearer names, coherent implementations), worse on boundary judgment (scope, minimality, robustness). Tokens and tool calls confirm it: the candidate was spending more to get there, not less. "Better instructions make the agent cheaper" did not hold on the holdout.
Methodology
The setup was Codex with gpt-5.5, medium reasoning, on real historical Stet tasks (dogfooding). Stet scored tests, strict publishability, equivalence, code review, footprint, total input/output tokens, duration, and craft/discipline rubrics like simplicity, coherence, robustness, instruction adherence, scope discipline, and diff minimality. The grader was gpt-5.4.
8 iterations on an n=5 sample set, and a n=10 task holdout.
I know sample size is small - the goal of this was to get directional analysis, and prove the methodology
Codex was set with a simple /goal: iterate AGENTS.md to improve performance on the benchmark.
decision deltascandidate minus baseline<br>signaliteration 7n=5 training sliceclean holdoutn=10 final checktests
+20pp
flat<br>equivalence
0 wins, 1 loss
1 win, 0 losses<br>review overall
+0.10
+0.05<br>correctness
flat
-0.20<br>scope discipline
-0.28
-0.49<br>footprint risk
-0.10
+0.039<br>token use
in -10.4M, out -24.6K
in +17.8M, out +23.6K<br>tool calls (captured)
partial -33.6 avg
+12.8 avg
Process
The first round of iteration showed something I wish more people internalized: plausible instructions are not necessarily good interventions.
Codex first tried a broad router rule: identify the work type, state a hypothesis before editing, read the right docs, and treat scope as part of correctness. It sounded good but exposed a failure mode: the agent could interpret "small scope" as permission to miss named obligations.
The next candidate added an "obligation ledger". Before editing, the agent had to identify the named behavior, compatibility constraints, docs, tests, and non-goals. Before reporting back, it had to mark each as met, missed, or not checked.
Here is the actual diff shape.
AGENTS.md diffsexpand for exact text<br>first-loop candidate1 removed, 6 added+<br>iteration 61 removed, 1 added+<br>iteration 71 removed, 2 added+<br>iteration 82 removed, 2 added+
That obligation-ledger candidate was the first useful signal. Code review improved by +0.75, correctness by +0.60, maintainability by +1.00, simplicity by +0.64, coherence by +0.60, and scope discipline by +0.36. Tests stayed flat at 5/5. But footprint risk got slightly worse, and the evidence was still a small same-sample read.
If I were editing by vibes, I might have shipped it. The eval said: useful direction, not a clean win, keep iterating.
Codex then tested the kind of rule that intuitively makes sense: prefer existing helpers, schemas, reporting paths, and public contracts before adding new machinery. It sounded correct - and the eval hated it. Tests still passed, which is...