Better Models: Worse Tools | Armin Ronacher's Thoughts and Writings
Armin Ronacher's Thoughts and Writings
blog<br>archive<br>projects<br>travel<br>talks<br>about
Better Models: Worse Tools
written on July 04, 2026
A very strange Pi issue<br>sent me down a rabbit hole over the last two days. The short version is that<br>newer Claude models sometimes call Pi’s edit tool with extra, invented fields in<br>the nested edits[] array. And not Haiku or some small model: Opus 4.8. The<br>edit itself is usually correct but the arguments do not match the schema as<br>the model invents made-up keys and Pi thus rejects the tool call and asks to<br>try again.
That alone is not too surprising as models emit malformed tool calls sometimes.<br>Particularly small ones. What surprised me is that this is getting worse with<br>newer Anthropic models as both Opus 4.8 and Sonnet 5 show it but none of the<br>older models. In other words, the SOTA models of the family are worse at this<br>specific tool schema than their older siblings.
In case you are curious about Fable: I intentionally did not test it because I<br>was not sure if the classifiers they are running might downgrade me to Opus<br>silently.
Tool Calls Are Text
If you have not spent too much time looking at LLM tool calling internals, the<br>important thing to understand is that tool calls are not magic and use some<br>rather crude in-band signalling. The model receives a transcript, a system<br>prompt and a list of available tools. The server munches that into a large<br>prompt with special marker tokens. Because the model was trained and<br>reinforced on examples of that format, at some point during generation it emits<br>something that the API or client interprets as "call this tool with these<br>arguments".
For a file edit tool, the intended invocation payload might say something like<br>this:
"path": "some/file.py",<br>"edits": [<br>"oldText": "text to replace",<br>"newText": "replacement text"
A harness then validates the arguments, performs the edit, and feeds the result<br>back into the model. If validation fails, the model sees an error and usually<br>tries again.
How exactly that formatting happens is not known for the Anthropic models, but<br>some people have gotten out "ANTML" markers and they at times do leak also into<br>public communications. To the best of my knowledge, the call above would come<br>out serialized like this from the model:
name="edit"><br>name="path">some/file.py<br>name="edits"><br>"oldText": "text to replace",<br>"newText": "replacement text"
An important thing to note here is that this thing, while looking like XML, is<br>not really XML. It’s just a thing they found convenient to tokenize and train<br>on. The other thing to note is that a basic top-level string parameter appears<br>in-line whereas an array of objects is implemented via JSON serialization.<br>While I’m not entirely sure that this is how it works, there are some<br>indications that this is not too far off. This will become relevant later.
There are two very different ways to make the model produce a structure like<br>this:
You can ask the model to produce valid JSON matching a schema and then<br>validate it afterwards.
You can constrain the sampler so that invalid JSON, or even invalid schema<br>shapes, cannot be sampled in the first place.
The second approach is what people usually refer to as grammar-aware or<br>constrained decoding. The sampler masks out tokens that would violate the<br>grammar. If the model is currently inside a JSON object and the schema says<br>only oldText and newText are allowed, the sampler can prevent it from<br>emitting "in_file" or "type". Grammar-aware decoding can be used both to<br>constrain something to be syntactically valid JSON and also to enforce specific<br>enum values or keys.
Without any form of constraints the model is merely following a learned<br>convention.
The Failure
Pi’s edit tool supports multiple exact string replacements in one call. That is<br>why the arguments contain an edits array. In the failing cases the model<br>produces entries like this:
"oldText": "...",<br>"newText": "...",<br>"requireUnique": true
or this:
"oldText": "...",<br>"newText": "...",<br>"oldText2": "",<br>"newText2": ""
Across repeated trials I saw a whole zoo of invented trailing keys: type,<br>id, kind, unique, requireUnique, matchCase, in_file,<br>forceMatchCount, children, notes, cost, oldText2, newText2,<br>oldText_2, newText_2, and even an event.0.additionalProperties key inside<br>the edit object itself.
The most annoying part is that the actual oldText and newText payloads were<br>byte-correct in the invalid calls I inspected. The model had in fact produced<br>the right invocation but then added nonsense at the end of the object.
The failure is also heavily context-dependent. A fresh single-turn prompt like<br>"edit this file" did not reproduce it at all for me. An agentic history where the<br>model had read files, diagnosed a problem and then composed a multi-line edit<br>could reproduce it. And more annoyingly, not all transcripts will show that behavior.<br>In fact, I needed Petr Baudis‘s transcripts...