Agentic RL: Token-In, Token-Out Done Right

gmays1 pts0 comments

Agentic RL: Token-In, Token-Out Done Right<br>Agentic RL: Token-In, Token-Out Done Right

hover to disturb

Never re-encode what you decoded: the whole trick to correct multi-turn RL.

Table of Contents

You’re training an LLM with RL. Single-turn looks great: clean curves, sane rewards, things converge. But modern models are enhanced with tools, and that’s exactly what you want: to train an agent.

So you upgrade your training loop to allow the model to call a tool mid-rollout. You start with an easy task, and the curves get weird. Loss occasionally spikes for no obvious reason. And eventually it fails with a shape mismatch error.

What’s almost certainly going on: your rollout loop is silently violating the Token-In, Token-Out (TITO) invariant. You parsed the model’s response to detect tool calls, then re-tokenized the updated conversation for the next turn. Usually that round-trip gives back the same tokens. Sometimes it doesn’t, and the gradient ends up on a sequence the model never sampled. The code doesn’t crash, but the math is silently broken and the gradient signal becomes completely unreliable.

Two ways to fix it.

The first is to abstract the chat template behind a per-model interface. For every family you train on, you hand-code a renderer that knows how to format messages, parse completions, and bridge between turns without re-rendering. It’s tricky to get right. The renderers library does this. It works, and it covers the major open-weights families today. The cost is structural: every new model needs a new hand-coded renderer, and changes to any template propagate as ongoing maintenance.

The second is to design the training around one rule: never re-encode tokens you’ve decoded. Follow it, and the tricky edge cases vanish. You’re left with a single property to check on the chat template: it must be prefix-preserving for tool messages (we’ll explain). Turns out the vast majority of templates in the wild already satisfy it. This is Token-In, Token-Out done right, and that’s what this post is about.

Train on the model’s own tokens

tl;dr RL updates the model on the exact tokens it sampled, and nothing else. Simple now, load-bearing later.

Reinforcement learning, in one breath: you sample a prompt, the model generates a completion, you score the completion, you backprop the gradient through the model’s generated tokens.

Single-turn RL loop.

sample prompt

[{"role": "user", "content": "What's 2+2?"}]

tokenize prompt

1023421799<br>"What's 2+2?"

generate completion

4799<br>"4."

compute reward

+1

backprop on assistant tokens

∇ on<br>4799

One detail matters more than it looks. The gradient is computed on the tokens the model generated. That sounds obvious. What else would you train on? It is obvious. Remember it anyway, because you’re going to break it sooner than you think.

Multi-turn doesn’t change much. The model is allowed to call a tool mid-rollout: it emits a tool call, something on the outside runs the tool, the result is appended back into the conversation, and the model picks up from there. The rollout is just longer now: a few model turns, a few tool turns, a final answer.

Multi-turn RL loop, with a tool call.

sample prompt

[{"role": "user", "content": "What's 2+2?"}]

tokenize prompt

1023421799<br>"What's 2+2?"

generate completion

50711399<br>"calc(2+2)"

execute tool and append result

6046199<br>"4"

generate completion

4799<br>"4."

compute reward

+1

backprop on assistant tokens

∇ on<br>50711399<br>4799

The rule carries over: backprop on the tokens the model produced. Not the tool’s response (those didn’t come from the policy).

The takeaway is small and very specific: in RL, you optimize on the exact tokens the model produced. Right now it reads like a definition. Later in the post, it’s the thing that breaks.

Decoding doesn’t undo encoding

tl;dr Tokenization isn’t reversible: decode a sequence, re-encode the text, and you can land on different tokens.

Going from messages to tokens is mechanical: a chat template renders the messages into a string, then the tokenizer chops that string into integer IDs.

>>> from transformers import AutoTokenizer<br>>>> tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")<br>>>> messages = [<br>... {"role": "user", "content": "What's 2+2?"},<br>... {"role": "assistant", "content": "4."}<br>... ]<br>>>> tokenizer.apply_chat_template(messages, return_dict=False)<br>[151644, 8948, 198, 2610, 525, 1207, 16948, 11, 3465, 553, 54364, 14817, 13, 1446, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 3838, 594, 220, 17, 10, 17, 30, 151645, 198, 151644, 77091, 198, 19, 13, 151645, 198]

Most of the time you don’t think about it. You feed messages, you get tokens, the model does its thing.

Multi-turn is where it starts to matter. When the assistant emits tokens, you don’t know whether it’s about to call a tool until you look. So you decode the generated IDs back into text, parse out the structure, dispatch the call. The pipeline runs backwards,...

model tokens tool token turn messages

Related Articles