SkillSpec – verify that agent skills run the way SKILL.md says

SkillSpec - Verifiable Agent Skills

Agent skill verifier and executor

Make agent skills follow the plan.

Today, most skills are instructions the model may or may not follow. SkillSpec checks a SKILL.md, turns the important parts into a contract, and helps the agent run the right skill with the right steps, tools, checks, and proof.

Assess a skill Install SkillSpec

Open-source CLI and Open-standard contract format for verifiable SKILL.md execution.

Platform shift

Where we are: the boom, and its hidden tax

Skills won. They are now a shared packaging format across major agent surfaces, and the public ecosystem jumped from a small catalog into a real infrastructure layer. The cost is that every new skill also adds routing pressure, context pressure, and trust pressure.

20-day surge 18.5x Reported growth from 2,179 skills on Jan 16 to 40,285 by early February.

Published skills 40,285 Verified public corpus in the Agent Skills data-driven analysis.

Peak day 8,857 Skills reportedly added on Jan 25 alone at the burst peak.

Name collisions 46% Listed skills share a name with at least one other skill in the analyzed catalog.

01 The context tax compounds

Every installed skill advertises itself before work starts. At roughly 100 tokens per skill, small libraries feel cheap; large libraries become a standing bill. Codex caps the initial skill list at 2% of context, or 8,000 characters when the window is unknown, then shortens or omits skills. Once selected, the full body is still read; the 40k-skill study reports a median body around 1,414 tokens with a heavy tail.

Failure mode Expensive skills quietly become less discoverable.

02 Discovery is already breaking

In a catalog this large, near-duplicates are normal. About 46% of listed skills share a name with at least one other skill, so the agent is asked to pick the right capability from a pile of similar names and compressed descriptions.

Failure mode Wrong-skill selection becomes routine, not exceptional.

03 The trust gap becomes a security gap

A skill is prose, and prose can be skipped, reframed, or forgotten. Description-only framing selected adversarial variants in 77.6% of paired trials, and context compaction can erase safety rules and push prohibited tool actions from 0% to as high as 59%.

Failure mode The final answer is not proof that the plan was followed.

Sources

Ling, Zhong & Huang, Agent Skills: A Data-Driven Analysis

verifies the 40,285-skill corpus; day-by-day growth and peak figures are derived from reported analyses of that dataset.

Microsoft Agent Skills documentation

describes the advertise, load, resource, and script stages, including the roughly 100-token advertise tier.

OpenAI Codex Agent Skills documentation

describes initial skill-list budgeting, description shortening or omission, and full SKILL.md loading after selection.

Under the Hood of SKILL.md

studies semantic supply-chain attacks against skill discovery, selection, and governance.

Governance Decay

measures context-compaction failures that erase safety constraints.

Omission Constraints Decay While Commission Constraints Persist

analyzes why prohibition-style constraints decay under context pressure.

The real failure

The problem isn't weak skills, it's load-bearing prose.

The most important behavior in a good skill is buried in paragraphs: use this route, never substitute that tool, get approval before the destructive step, prove the test ran.

Drift The model skips, reorders, or substitutes instructions it reinterprets from scratch.

Waste The same guidance is reread and repaid on every run, even when only one step matters.

Unprovability A polished final answer is not evidence that the required path, checks, or tools were used.

You do not fix that by writing better paragraphs. Keep prose for judgment, and move the must-follow parts into a small, checkable contract beside it.

SkillSpec now

SkillSpec: import → execute → align

SkillSpec adds one small file, skill.spec.yml, next to your existing SKILL.md. The skill still works anywhere; the contract just makes the critical parts machine-checkable.

01 Assess

Doctor measures a skill before you trust it: token load, buried instructions, name collisions, missing proof, and public URL risk. It gives you a number, not a vibe.

02 Import

Convert the load-bearing prose once into routes, rules, forbids, deny-by-default tool boundaries, and regression tests that can be reviewed and versioned.

03 Execute

At run time, ask the CLI for the current route, phase, and tool boundary. The manual stays on disk; the agent only holds the slice it needs now.

04 Align

Replay the decision trace and compare the run to the resolved contract. The verdict is honest: aligned, partial, or unproven.

First run explores. Every later run recalls.

The trace records which rule caused which route, fingerprinted by the resolved spec and input hash, so drift is visible when the skill changes.

What changes

Current...

SkillSpec – verify that agent skills run the way SKILL.md says

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

Italy's Meloni says Trump 'made up' story that she 'begged' him for photo at G7