Don't make a prompt cut you can't measure

edonadei1 pts0 comments

Cutting a skill in half without making it worse — edonadei Skip to content I was scrolling X when a tweet from Matt Pocock caught me. [2] His point: skills are littered with no-ops, lines like “be thorough” or “write a detailed commit message”. This is useless because models are RL to the bone already with this, so adding it to the skill is slop.

Usually modifying a skill means adding stuff. This one tried to make it smaller, and what was tricky was proving that smaller did not mean worse.

Skills are always-loaded context. Every line in a SKILL.md is paid for on every invocation, whether or not that line ends up mattering. Matt Pocock has written a writing-great-skills skill. [1] One of the things he advocates for is to prune to the shortest skill that still behaves predictably. The hard part is the second clause. It is easy to delete lines. It is hard to know whether you deleted a load-bearing one.

I took two skills, grill-skill [3] and evaluate-skill [4], and cut them by roughly 45% each. The claim I wanted to defend was not “these are better now.” It was narrower and more honest:

The shortened skill is no worse than the original, and still clearly beats the raw agent with no skill at all.

That is a non-inferiority test, and it changes how you evaluate.

Ablation is all you need

The classic move when you modify a skill is to add precise instructions, modify slop that your model wrote. It’s valid. But a skill trajectory should mostly to be trimmed with time. Models are getting better with every day passing, so your skill is becoming more irrelevant with time. But also models need SPACE! They need a clear context window to avoid being dumb. So let it breathe!

So here the exercise was to run ablation on the skill. I held each skill’s eval fixed and varied only the skill text, comparing pass@k across three variants:

control = the current full skill<br>candidate = a shortened variant<br>baseline = the raw agent, no skill at all<br>The baseline matters more than it looks. Without it, a shortened skill that quietly stopped working could still “pass” against a weak control, and you would never know the skill had stopped contributing anything. The baseline is the floor. If the candidate sinks toward it, the cut removed something real.

The decision rule:

Non-inferior at k=3. The candidate shows no task-level regression against the control.

Still earns its keep. The candidate still clearly beats the no-skill baseline.

Equalling or beating the control is a bonus, not a requirement. The claim is “no worse,” not “better.”

Debugging ran at k=1 to shake out spec and prompt bugs cheaply. Only the accept/reject decision ran at k=3. Both skills used the claude-code backend so the numbers were comparable.

Harden the eval before you trust it

While doing that exercise, I understood that the pre-existing eval was just bad. It was not exercising the prose I intended to cut. They tested YAML serialization and structure, the mechanical output, but never the interview discipline or the judgment guidance that made up the bulk of the text. An eval that cannot feel the damage makes every cut look safe.

So I rewrote the evals first. This feels backwards, since you are strengthening the thing that might block your change, but it is the only way the “no worse” claim means anything. A green light from an eval that can’t see the wound is not evidence.

For grill-skill, the old eval pre-supplied the whole task structure, so it only ever tested serialization. The new one has three tasks:

Interview discipline (first turn). An open prompt, no structure supplied. The judge checks that the agent interviews the user and stops, rather than fabricating answers and charging ahead. It is backed by a deterministic assert that no spec was written during the turn.

Gap-fill detect-and-ask. An existing eval is present. The judge checks that the agent reports the existing tasks and asks what is missing before touching anything, with an assert that the existing spec is left unmodified.

Serialize a valid spec. A fully-directive prompt, graded by a deterministic structure assert.

The first task taught me something worth writing down: the prompt has to signal that the user is present and will answer. Without that signal, an agent that just does the task instead of interviewing is being rational, not disobedient. The eval was measuring the wrong thing until the prompt made the human’s presence explicit.

grill-skill: 143 → 55 lines

The grill-skill cut was the clean one.

BeforeAfterReductionSKILL.md143 lines / ~1240 tok55 lines / ~703 tok~43%<br>What came out:

The inline caliper command blocks in the interview phases that duplicated REFERENCE.md. Those became a pointer to REFERENCE.md: a single source of truth instead of two copies that drift.

The verbose “quoted script” instructions, long blocks of exact wording to say to the user, collapsed into concise directives.

The intro paragraph that restated the description.

What stayed, verbatim in...

skill eval prompt worse skills lines

Related Articles