No Token Left Behind: Demystifying Token-In-Token-Out in Miles - LMSYS Blog | LMSYS Org<br>Projects<br>Blog<br>About<br>Donations<br>Contact
‹ Back to Blog‹ Back to BlogContents<br>Definition of TITO<br>Why TITO matters?<br>Training Efficiency: One Sample Per Task<br>Mathematical Correctness: Maintaining On-Policyness<br>How TITO might break<br>Scenario 1: Detokenize-retokenize mismatch<br>Scenario 2: Reasoning pruned by chat templates<br>Scenario 3: Lossy chat-template re-rendering<br>How TITO is implemented in Miles<br>(1) Inference session server<br>(2) Ensure append-only at three levels<br>(3) A pluggable TITO tokenizer<br>(4) Verification via a token-sequence comparator<br>Supported Models
No Token Left Behind: Demystifying Token-In-Token-Out in Miles<br>Miles Team: Jiajun Li, Yanbin Jiang, Mao Cheng, Shi Dong, Yusheng Su, Yueming Yuan, Zhichen Zeng, Banghua ZhuJune 5, 2026<br>In agentic RL, a rollout is not a single generation. It is a chain of model calls, tool outputs, harness messages, and resumed generations. Token-In-Token-Out (TITO) is a design principle that addresses one critical source of training–inference mismatch in this process: whether the trainer evaluates the exact same token sequence that the inference engine consumed and produced during rollout. In this blog post, we aim to clarify how we define the TITO principle, why it is important in RL training, and how such principle is instantiated in the Miles framework.
Definition of TITO
In an agentic rollout, the model repeatedly interacts with an external environment. In a simplified setting, the model first receives a task description and generates tokens, which may include reasoning and a tool call. The agent runtime parses the tool call, sends it to the corresponding environment or tool backend, and returns the result as a new observation. The model then continues from that observation and may issue another tool call. This loop repeats until the task is complete.
Note that the process involves multiple separate calls to the inference engine, which people colloquially define as turns. In each turn, the engine is prompted with a token sequence and generates another token sequence. We say that the TITO principle is fulfilled if, for all nnn, the total token sequence in turn n−1n-1n−1 (prompt + response) is a bit-perfect prefix of the prompt token sequence in turn nnn. The idea is illustrated in the following diagram.
Why TITO matters?
Training Efficiency: One Sample Per Task
In agentic RL, where a single task can have dozens of turns, we essentially have two options to package data for the RL trainer:
One Sample Per Turn: Each turn is treated as an independent training sample.
One Sample Per Task: All turns are "glued" together into a single, contiguous sequence.
Let us compare both options. In option 1, the trainer receives as many samples as there are turns in a trajectory; whereas in option 2, the trainer always receives one sample per task instantiation, regardless of the number of turns. For a typical SWE-Bench-like task, a trajectory consists of 30-50 turns, which means that to ingest the same amount of information, option 2 only has to spend an order of magnitude less compute compared with option 1. Such massive reduction in compute cost makes option 2 especially appealing for scaling up agentic RL training.
Mathematical Correctness: Maintaining On-Policyness
For a training sample to be on-policy, every sampled token should be evaluated by the trainer under the same conditional distribution that produced it during rollout. In transformers, that conditional distribution is entirely dependent on the preceding context of the token. If TITO is violated, there could be a token xtx_txt such that
In the trainer, the model evaluates xtx_txt based on the preceding sequence x\mathbf{x}x.
In the inference engine, the model samples xtx_txt based on a slightly different preceding sequence x~\tilde{\mathbf{x}}x~.
Even if the trainer and the inference engine share identical weights, the conditional probability π(xt∣x)\pi(x_t|\mathbf{x})π(xt∣x) can diverge dramatically from π(xt∣x~)\pi(x_t|\tilde{\mathbf{x}})π(xt∣x~). Such discrepancy can eventually lead to erratic updates, jeopardizing the stability of RL training.
How TITO might break
Despite its conceptual simplicity, the TITO principle is fragile. In what follows, we provide three common scenarios, among many others, where the principle could be violated.
Scenario 1: Detokenize-retokenize mismatch
In multi-turn RL rollouts, one might detokenize the model's generated tokens into a string for storage, and subsequently retokenize it when building the prompt for turn nnn. This can potentially break the TITO principle because model-generated tokens cannot necessarily survive a detokenize-retokenize roundtrip .
The root cause lies in the asymmetry between how a tokenizer encodes text and how a model generates tokens:
encode (text → tokens) is one-to-one : For a given input string, the tokenizer always picks one standard split...