SkillSpec – verify that agent skills run the way SKILL.md says

DreyGreatness1 pts0 comments

SkillSpec - Verifiable Agent Skills

Agent skill verifier and executor

Make agent skills follow the plan.

Today, most skills are instructions the model may or may not follow.<br>SkillSpec checks a SKILL.md, turns the important parts<br>into a contract, and helps the agent run the right skill with the<br>right steps, tools, checks, and proof.

Assess a skill<br>Install SkillSpec

Open-source CLI and Open-standard contract format for verifiable<br>SKILL.md execution.

Platform shift

Where we are: the boom, and its hidden tax

Skills won. They are now a shared packaging format across major agent<br>surfaces, and the public ecosystem jumped from a small catalog into a<br>real infrastructure layer. The cost is that every new skill also adds<br>routing pressure, context pressure, and trust pressure.

20-day surge<br>18.5x<br>Reported growth from 2,179 skills on Jan 16 to 40,285 by early February.

Published skills<br>40,285<br>Verified public corpus in the Agent Skills data-driven analysis.

Peak day<br>8,857<br>Skills reportedly added on Jan 25 alone at the burst peak.

Name collisions<br>46%<br>Listed skills share a name with at least one other skill in the analyzed catalog.

01<br>The context tax compounds

Every installed skill advertises itself before work starts. At<br>roughly 100 tokens per skill, small libraries feel cheap; large<br>libraries become a standing bill. Codex caps the initial skill<br>list at 2% of context, or 8,000 characters when the window is<br>unknown, then shortens or omits skills. Once selected, the full<br>body is still read; the 40k-skill study reports a median body<br>around 1,414 tokens with a heavy tail.

Failure mode<br>Expensive skills quietly become less discoverable.

02<br>Discovery is already breaking

In a catalog this large, near-duplicates are normal. About 46% of<br>listed skills share a name with at least one other skill, so the<br>agent is asked to pick the right capability from a pile of similar<br>names and compressed descriptions.

Failure mode<br>Wrong-skill selection becomes routine, not exceptional.

03<br>The trust gap becomes a security gap

A skill is prose, and prose can be skipped, reframed, or forgotten.<br>Description-only framing selected adversarial variants in 77.6% of<br>paired trials, and context compaction can erase safety rules and<br>push prohibited tool actions from 0% to as high as 59%.

Failure mode<br>The final answer is not proof that the plan was followed.

Sources

Ling, Zhong & Huang, Agent Skills: A Data-Driven Analysis

verifies the 40,285-skill corpus; day-by-day growth and peak figures<br>are derived from reported analyses of that dataset.

Microsoft Agent Skills documentation

describes the advertise, load, resource, and script stages, including<br>the roughly 100-token advertise tier.

OpenAI Codex Agent Skills documentation

describes initial skill-list budgeting, description shortening or<br>omission, and full SKILL.md loading after selection.

Under the Hood of SKILL.md

studies semantic supply-chain attacks against skill discovery,<br>selection, and governance.

Governance Decay

measures context-compaction failures that erase safety constraints.

Omission Constraints Decay While Commission Constraints Persist

analyzes why prohibition-style constraints decay under context pressure.

The real failure

The problem isn't weak skills, it's load-bearing prose.

The most important behavior in a good skill is buried in paragraphs:<br>use this route, never substitute that tool, get approval before the<br>destructive step, prove the test ran.

Drift<br>The model skips, reorders, or substitutes instructions it reinterprets from scratch.

Waste<br>The same guidance is reread and repaid on every run, even when only one step matters.

Unprovability<br>A polished final answer is not evidence that the required path, checks, or tools were used.

You do not fix that by writing better paragraphs. Keep prose for<br>judgment, and move the must-follow parts into a small, checkable<br>contract beside it.

SkillSpec now

SkillSpec: import &rarr; execute &rarr; align

SkillSpec adds one small file, skill.spec.yml, next to<br>your existing SKILL.md. The skill still works anywhere;<br>the contract just makes the critical parts machine-checkable.

01<br>Assess

Doctor measures a skill before you trust it: token load, buried<br>instructions, name collisions, missing proof, and public URL risk.<br>It gives you a number, not a vibe.

02<br>Import

Convert the load-bearing prose once into routes, rules, forbids,<br>deny-by-default tool boundaries, and regression tests that can be<br>reviewed and versioned.

03<br>Execute

At run time, ask the CLI for the current route, phase, and tool<br>boundary. The manual stays on disk; the agent only holds the slice<br>it needs now.

04<br>Align

Replay the decision trace and compare the run to the resolved<br>contract. The verdict is honest: aligned, partial, or unproven.

First run explores. Every later run recalls.

The trace records which rule caused which route, fingerprinted by<br>the resolved spec and input hash, so drift is visible when the skill<br>changes.

What changes

Current...

skill skills agent skillspec from context

Related Articles