SkillOpt – Executive Strategy for Self-Evolving Agent Skills

renarl3 pts0 comments

SkillOpt | Executive Strategy for Self-Evolving Agent Skills

Text-space optimization for frozen agents<br>SkillOpt

Executive Strategy for Self-Evolving Agent Skills. SkillOpt treats a compact<br>natural-language skill document as the trainable state of a frozen language<br>agent, then learns that document through rollouts, reflection, bounded edits,<br>and held-out validation gates.

Core Idea<br>Method<br>View Results

Code Repo

Paper

Video

Related project<br>SkillLens studies model-generated agent skills.<br>A companion project page from Microsoft Research.

->

Main result

52<br>/52

Best or tied-best in every model x benchmark and harness x benchmark setting.

Target models

Benchmarks

Harnesses<br>Codex + Claude Code

Project Video

SkillOpt in motion.

A short visual overview of how SkillOpt treats natural-language skills<br>as trainable artifacts: roll out, reflect, edit, validate, and export.

Promotional video for the SkillOpt project page. The static paper teaser is shown below for high-resolution inspection.

Paper Teaser

The core loop at a glance.

The teaser summarizes the SkillOpt training loop: rollout evidence,<br>optimizer-side reflection, bounded skill edits, validation gating,<br>and the exported reusable skill.

Figure from the SkillOpt paper. On small screens, the figure area scrolls horizontally to preserve the original details.

01 / Core Idea

Train the procedure, not the weights.

SkillOpt makes the skill document itself the optimization target. The<br>target model, backend, and harness stay fixed; the procedure that guides<br>evidence gathering, tool use, verification, and output formatting evolves.

A skill is external state for an agent.

Instead of fine-tuning a model or hand-maintaining prompts, SkillOpt runs<br>the frozen agent on scored batches, asks a separate optimizer model to<br>propose structured edits, and accepts a candidate only when validation<br>performance improves.

Frozen target model<br>Optimizer model<br>Add / delete / replace edits<br>Held-out gate

Rollout<br>The target model executes tasks with the current skill and records scored trajectories.

Reflect<br>The optimizer analyzes success and failure minibatches to find reusable procedures.

Edit<br>Candidate add, delete, and replace operations are merged and ranked under a budget.

Gate<br>The candidate skill is kept only if it improves held-out selection performance.

02 / Method

A training loop for natural-language skills.

The loop deliberately mirrors a learning algorithm: rollout evidence acts<br>like a forward pass, reflection acts like a language-level backward pass,<br>and the textual learning rate bounds how far the skill can move.

Evidence

Rollout batches capture messages, tool calls, verifier feedback, task metadata, and final scores.

Minibatches

Failures and successes are reflected separately so edits correct recurring errors while preserving working behavior.

Bounded Edits

An edit budget functions as a textual learning rate, preventing useful rules from being overwritten by broad rewrites.

Memory

Rejected edits, slow update, and optimizer-side meta skill provide longer-horizon feedback without bloating deployment.

SkillOpt pipeline from the paper. The frozen target model executes with the current skill; the optimizer model proposes bounded edits; held-out validation decides whether the candidate becomes the new current skill.

03 / Main Results

SkillOpt improves GPT and Qwen target models.

The table reports main-result gains across target models and<br>execution harnesses, comparing no-skill execution with the final<br>SkillOpt skill on held-out test splits.

Target model<br>Harness<br>SearchQA<br>Sheet<br>Office<br>DocVQA<br>LiveMath<br>ALFWorld<br>Avg gain

GPT-5.5<br>Direct chat<br>+9.6<br>+38.9<br>+39.0<br>+12.4<br>+29.3<br>+11.9<br>+23.5

GPT-5.4<br>Direct chat<br>+6.2<br>+21.1<br>+12.8<br>+13.6<br>+7.2<br>+15.6<br>+12.8

GPT-5.4-mini<br>Direct chat<br>+4.3<br>+11.4<br>+26.7<br>+16.5<br>+4.8<br>+12.7<br>+12.7

GPT-5.4-nano<br>Direct chat<br>+19.0<br>+8.2<br>+33.7<br>+49.4<br>+4.0<br>+35.1<br>+24.9

GPT-5.2<br>Direct chat<br>+11.2<br>+18.9<br>+21.5<br>+16.5<br>+15.2<br>+16.4<br>+16.6

Qwen3.5-4B<br>Direct chat<br>+3.1<br>+14.6<br>+15.2<br>+2.1<br>+29.6<br>+50.7<br>+19.2

Qwen3.6-35B-A3B<br>Direct chat<br>+7.6<br>+9.3<br>+1.2<br>+3.8<br>+10.4<br>+22.4<br>+9.1

GPT-5.5<br>Codex<br>+5.5<br>+57.5<br>+12.8<br>+5.0<br>+28.0<br>N/A<br>+21.8

GPT-5.5<br>Claude Code<br>+4.0<br>+58.3<br>+13.9<br>+3.5<br>+13.3<br>N/A<br>+18.6

Method comparison<br>SkillOpt clears the strongest baseline on every benchmark.

04 / Ablations

The controls are doing real work.

The paper isolates the optimizer components that keep skill learning stable:<br>enough evidence, bounded textual updates, rejected-edit feedback, slow<br>update, and optimizer-side memory.

Component<br>Setting<br>SearchQA<br>Spreadsheet<br>LiveMath

Learning rate<br>lr=4 default<br>87.1<br>77.5<br>61.3

Learning rate<br>without lr<br>84.6<br>75.7<br>57.3

Rejected buffer<br>with buffer<br>87.1<br>77.5<br>61.3

Rejected buffer<br>without buffer<br>85.5<br>72.9<br>58.9

Update memory<br>meta skill + slow update<br>87.1<br>77.5<br>61.3

Update memory<br>without both<br>86.3<br>55.0<br>59.7

What the ablations say

Bounded<br>Textual learning rates prevent destructive rewrites while keeping enough plasticity to learn new...

skillopt skill model target edits optimizer

Related Articles