SkillSpec - Verifiable Agent Skills
Agent skill verifier and executor
Make agent skills follow the plan.
Today, most skills are instructions the model may or may not follow.<br>SkillSpec checks a SKILL.md, turns the important parts<br>into a contract, and helps the agent run the right skill with the<br>right steps, tools, checks, and proof.
Assess a skill<br>Install SkillSpec
Open-source CLI and Open-standard contract format for verifiable<br>SKILL.md execution.
Platform shift
Where we are: the boom, and its hidden tax
Skills won. They are now a shared packaging format across major agent<br>surfaces, and the public ecosystem jumped from a small catalog into a<br>real infrastructure layer. The cost is that every new skill also adds<br>routing pressure, context pressure, and trust pressure.
20-day surge<br>18.5x<br>Reported growth from 2,179 skills on Jan 16 to 40,285 by early February.
Published skills<br>40,285<br>Verified public corpus in the Agent Skills data-driven analysis.
Peak day<br>8,857<br>Skills reportedly added on Jan 25 alone at the burst peak.
Name collisions<br>46%<br>Listed skills share a name with at least one other skill in the analyzed catalog.
01<br>The context tax compounds
Every installed skill advertises itself before work starts. At<br>roughly 100 tokens per skill, small libraries feel cheap; large<br>libraries become a standing bill. Codex caps the initial skill<br>list at 2% of context, or 8,000 characters when the window is<br>unknown, then shortens or omits skills. Once selected, the full<br>body is still read; the 40k-skill study reports a median body<br>around 1,414 tokens with a heavy tail.
Failure mode<br>Expensive skills quietly become less discoverable.
02<br>Discovery is already breaking
In a catalog this large, near-duplicates are normal. About 46% of<br>listed skills share a name with at least one other skill, so the<br>agent is asked to pick the right capability from a pile of similar<br>names and compressed descriptions.
Failure mode<br>Wrong-skill selection becomes routine, not exceptional.
03<br>The trust gap becomes a security gap
A skill is prose, and prose can be skipped, reframed, or forgotten.<br>Description-only framing selected adversarial variants in 77.6% of<br>paired trials, and context compaction can erase safety rules and<br>push prohibited tool actions from 0% to as high as 59%.
Failure mode<br>The final answer is not proof that the plan was followed.
Sources
Ling, Zhong & Huang, Agent Skills: A Data-Driven Analysis
verifies the 40,285-skill corpus; day-by-day growth and peak figures<br>are derived from reported analyses of that dataset.
Microsoft Agent Skills documentation
describes the advertise, load, resource, and script stages, including<br>the roughly 100-token advertise tier.
OpenAI Codex Agent Skills documentation
describes initial skill-list budgeting, description shortening or<br>omission, and full SKILL.md loading after selection.
Under the Hood of SKILL.md
studies semantic supply-chain attacks against skill discovery,<br>selection, and governance.
Governance Decay
measures context-compaction failures that erase safety constraints.
Omission Constraints Decay While Commission Constraints Persist
analyzes why prohibition-style constraints decay under context pressure.
The real failure
The problem isn't weak skills, it's load-bearing prose.
The most important behavior in a good skill is buried in paragraphs:<br>use this route, never substitute that tool, get approval before the<br>destructive step, prove the test ran.
Drift<br>The model skips, reorders, or substitutes instructions it reinterprets from scratch.
Waste<br>The same guidance is reread and repaid on every run, even when only one step matters.
Unprovability<br>A polished final answer is not evidence that the required path, checks, or tools were used.
You do not fix that by writing better paragraphs. Keep prose for<br>judgment, and move the must-follow parts into a small, checkable<br>contract beside it.
SkillSpec now
SkillSpec: import → execute → align
SkillSpec adds one small file, skill.spec.yml, next to<br>your existing SKILL.md. The skill still works anywhere;<br>the contract just makes the critical parts machine-checkable.
01<br>Assess
Doctor measures a skill before you trust it: token load, buried<br>instructions, name collisions, missing proof, and public URL risk.<br>It gives you a number, not a vibe.
02<br>Import
Convert the load-bearing prose once into routes, rules, forbids,<br>deny-by-default tool boundaries, and regression tests that can be<br>reviewed and versioned.
03<br>Execute
At run time, ask the CLI for the current route, phase, and tool<br>boundary. The manual stays on disk; the agent only holds the slice<br>it needs now.
04<br>Align
Replay the decision trace and compare the run to the resolved<br>contract. The verdict is honest: aligned, partial, or unproven.
First run explores. Every later run recalls.
The trace records which rule caused which route, fingerprinted by<br>the resolved spec and input hash, so drift is visible when the skill<br>changes.
What changes
Current...