Improving agents from trajectories, in token space, with no weight updates

Specifying the agent: closing the gap to frontier with a context engine and meta-distillation — Kinesthetic TL;DR

Frontier models (and thereby OSS) will continue paving their way towards dissolving most general failure modes in agents. What’s left is specification failure : the agent does the wrong thing because it was never told the right thing, or was told something contradictory. Enterprise teams are dealing with growing contexts and prompts as agents take on more complex work, and making that usable for agents is hard. We believe that the specification that derives agent context must be a first-class artifact, and a spec you can author, correct, and retrieve from is a durable asset that compounds. We study search and learning algorithms that attack this directly, on the τ³-bench banking agent suite, the FinanceBench benchmark, and the Harvey Legal Agent Benchmark across two open / open-weight backbones.

The Context Engine is a retrieval sidecar that scales test-time compute on retrieval instead of handing the model raw tools. By investing offline/test-time compute to survey the data, build its own structures and artifacts, and tailor search to the task, it returns fewer, better tokens to the model.

It lifts retrieval F1 on both backbones, and with it the action-check pass rate: on Mistral Large 3 recall and precision rise together, while on the stronger-retriever GPT-OSS 120B it trades recall for ~11× precision and cuts wasted work from agentic search tool calls.

Meta-distillation adds onto the context engine by distilling reusable procedural guidance from solved trajectories, in the token space. This lets feedback and reward signals propagate and improve both the context itself and the search mechanism used to retrieve it.

On both models, the action-check pass rate improves significantly, topping out the gains from the context engine.

τ³-bench · banking Both levers convert spec into correct action Action-check pass rate. Each bar starts at its backbone's all-tools baseline (muted) and stacks the gain the method adds on top.

All-tools baseline+ Context Engine+ Meta-distillation+ CE & MD (Mistral)

50%40%30%20%10%0%

41.5

45.7

46.2

15.0

16.8

Mistral Large 3 non-reasoning · base 28.2%

GPT-OSS 120B reasoning · base 6.0%

Action-check pass rate (Action r+w). Mistral: baseline & CE pooled over hold-out data (n=64); meta-distillation and the combined CE+MD arm on the val split (n=32). GPT-OSS: val hold-out (n=31–32); MD is the backbone-native buffer distilled from GPT-OSS's own trajectories.

FinanceBench + Harvey LAB The same two levers, two more domains Beyond banking, each method carries its headline result into a domain that isolates it: retrieval quality on financial filings, procedural learning on legal work.

FinanceBench · Context Engine Page-level retrieval F1: vanilla dense search vs. the Context Engine 0.29→0.55best vanilla → CE ~4–5× more precise at equal recall. The planner + entity graph + shell does the work; bespoke chunking didn't beat whole pages.

Harvey LAB · Meta-distillation Rubric pass rate: weak Mistral student, no guidance vs. distilled how-tos 0.35→0.45+0.10 over baseline Rescue the weak, hold the strong. Tasks the bare student couldn't do jump +0.37 (0.01 → 0.38); tasks it already handled stay flat.

Both from offline-mined teacher work, no weight updates. Full breakdowns in Stage 2 (Context Engine) and Stage 3 (meta-distillation) below.

The problem: specification failure

As base models improve, the failures that survive are increasingly not about raw capability. The model can read, plan, and call tools; what it lacks is the organization’s specific, often unwritten, specification of correct behavior: the rules, edge cases, company style, tool contracts, and worked examples that determine whether an action is right here and now. That knowledge lives in docs, prompts, people’s heads, and scattered traces. Nobody can audit it, and corrections to it don’t durably stick or scale.

We call this specification failure , and we think it is the dominant remaining failure mode for enterprise agents. The thesis behind Kinesthetic is that the response is not a bigger model or a cleverer prompt, but treating context, the spec, as a first-class artifact with intentionally-designed interfaces for its human and agent users: something you index, retrieve from, measure, and correct, with a clear authority gradient from human-authored ground truth down through derived, regenerable machinery.

Vanilla agentic search and context engineering fail to capture the nuance of the data that it works with. Providing an agent with generic search tools with no understanding of what the data contains, how it’s structured, or how it should be interpreted for a given scenario is like providing an explorer a compass but no map: the agent can wander forever and bloat the context (and exhaust the budget) with redundant search calls. Additionally, we believe that past trace data (or any historical data of...

Improving agents from trajectories, in token space, with no weight updates

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

The labor share of income in the US is at its lowest post-war level