Skills Are Where Your Judgment Goes
In my last post I argued you should give agents jobs, not processes: tell an agent what it’s responsible for, and split the work so tools enforce the rules while the skill’s instructions carry the goals and the judgment. So now let’s talk about how to build a good one.
Suppose you want to teach an agent to handle a real job: responding to escalated customer support tickets. Here’s a skill you’ve seen before: pull the ticket history, check the help center, draft a reply in the brand voice, tag and close. It looks responsible. It’s almost worthless as a skill, because every line is either generic model work or tool work. Now compare: a credit above $200 needs approval. A customer who mentions a deadline or a migration is churn-risk, not venting. Stop recommending the CSV workaround, it corrupted a customer’s data last month. We’d rather eat a $50 credit than spend an hour of an engineer’s time. The first skill writes down the steps. The second writes down the judgment.
The first one is the trap, and it’s worth being clear about why so many skills fall into it. Your current process feels like the work, but it isn’t. It’s the work bent around the tools you had and the handoffs between people, and the steps that were easy to write down were the mechanical ones. Hand those to the agent and you’ve automated the version of the job you already do, a little faster. That’s iteration. It’s not nothing, but it’s not what you’re here for. The agent can do far more than run your old playbook quicker, and getting there means letting go of the playbook and asking what the job actually needs. The parts that matter are the ones only you can answer, and they’re rarely the parts you wrote down.
It comes down to four questions. What can the model already do? Where does experience change the answer? What’s true here that isn’t true everywhere? What must the tool enforce? The answers decide what belongs in the skill, what belongs in the tool, and what shouldn’t be written down at all. A skill is for the part the model doesn’t get from general competence: your judgment, your local knowledge, and your rules for what good means here.
What’s mechanical?
Start by splitting the job into two piles.
The first pile is mechanical work. It’s the stuff where there’s a known way to do it, where you can follow the steps and get the result. Sometimes that’s a fixed procedure, sometimes it’s a bounded task you can hand to a tool, but either way, doing it well doesn’t take an expert. It takes correct execution, and a capable model already has that. Looking up an account, running a query, formatting a reply, fetching the relevant docs: mechanical. You might still point the agent at it, but that’s a cheap, shallow kind of teaching. Hand it a tool and, if the tool is well built, it explains itself, what it’s for, what it expects, what it returns, so a single line in the skill does the whole job: this is the customer CRM, use it to look up customer info. The agent takes it from there. The most a skill should say about a mechanical step is which tool to reach for and when, never how the tool works.
The second pile is judgment, and that word needs a plain definition because people throw it around loosely. Judgment is the part of the work where two things are true at once. There’s no procedure that settles it, because if a checklist could settle it, it would just be mechanical work. And yet experience reliably gets you better answers. Put a researcher who knows the method from the textbook next to one who has watched it fail in production, and the second one will catch that the sample can’t support the claim, that the method is trendy but shaky, that the real question isn’t the one being asked. No formula got them there. They’ve just seen what goes wrong. That’s judgment: no rule to follow, but a right and a wrong you can still be held to.
It shows up in every field. A new-grad engineer reaches for the clever abstraction that demos beautifully and becomes a maintenance sinkhole two years later; the staff engineer knows which corners are safe to cut on a throwaway prototype and which will haunt a system that lives for a decade. A junior analyst trusts the dashboard; the senior one knows which metric quietly lies. The mechanical pile is the same for everyone. The judgment pile is where experience changes the answer, and that’s the pile a skill is actually for.
Where does experience change the call?
Finding that judgment is the step people skip, because it’s rarely written down. The runbook recorded the mechanical steps and left the judgment in someone’s head, for the same reason it was easy to write down: nobody records what feels like second nature. And you won’t get it by...