Good teachers don’t cheat – jasonkena's blog
\[<br>\newcommand{\KL}[2]{D_{\mathrm{KL}}\!\left(#1 \,\|\, #2\right)}<br>\newcommand{\argmax}{\operatorname*{arg\,max}}<br>\]
TL;DR: Policy gradient RL, self-distillation techniques like SDFT, and Pedagogical RL can all be viewed as optimizing the same objective \(\mathbb{E}_\pi[R] - \beta\KL{\pi}{\pi_0}\), just with slightly different optimization procedures. The privileged information \(z\) that some of these methods feed in context is simply a tool to make the optimization of \(\mathbb{E}_\pi[R] - \beta\KL{\pi}{\pi_0}\) easier. The punchline is that, at optimality, the teacher’s use of \(z\) has to vanish: good teachers don’t cheat.
Background
Suppose we have a base policy \(\pi_0(y \mid x)\) and we are interested in learning the policy
\[<br>\pi^*(y \mid x) \propto \pi_0(y \mid x) \exp\!\big(R(x,y)/\beta\big).<br>\]
With binary rewards and \(\beta \to 0\), this collapses to \(\pi^*(y \mid x) \propto \pi_0(y \mid x)\,\mathbf{1}_{\{R(x,y) = 1\}}\) i.e., the base policy restricted to correct answers.1
We can characterize RL as an ELBO bound on the partition function \(Z(x)\): \[<br>\begin{aligned}<br>Z(x) &= \mathbb{E}_{y \sim \pi_0(\cdot \mid x)} \left[ \exp\!\big(R(x,y)/\beta\big) \right]<br>\end{aligned}<br>\]
For any distribution \(q(\cdot \mid x) \ll \pi_0(\cdot \mid x)\) (i.e., \(q(y \mid x) > 0\) implies \(\pi_0(y \mid x) > 0\)), we can rewrite this as an importance-weighted expectation under \(q\):
\[<br>Z(x) = \mathbb{E}_{y \sim q(\cdot \mid x)} \left[ \frac{\pi_0(y \mid x)}{q(y \mid x)} \exp\!\big(R(x,y)/\beta\big) \right].<br>\]
By Jensen’s inequality and concavity of \(\log\),
\[<br>\begin{aligned}<br>\log Z(x) &\geq \mathbb{E}_{y \sim q(\cdot \mid x)} \left[ R(x,y)/\beta - \log \frac{q(y \mid x)}{\pi_0(y \mid x)} \right] \\<br>\beta \log Z(x) &\geq \mathbb{E}_{y \sim q(\cdot \mid x)} \left[ R(x,y) \right] - \beta\, \KL{q(\cdot \mid x)}{\pi_0(\cdot \mid x)},<br>\end{aligned}<br>\]
where equality holds iff \(\frac{\pi_0(y \mid x)}{q(y \mid x)} \exp(R(x,y)/\beta)\) is constant over all \(y\) with \(q(y \mid x) > 0\); namely, when \(q(y \mid x) = \pi^*(y \mid x)\). It is well known that
\[<br>\pi^*(y \mid x) = \argmax_q\; \mathbb{E}_{y \sim q(\cdot \mid x)} \left[ R(x,y) \right] - \beta\, \KL{q(\cdot \mid x)}{\pi_0(\cdot \mid x)},<br>\]
and this is the \(\mathbb{E}_\pi[R] - \beta\KL{\pi}{\pi_0}\) objective that policy gradient methods (GRPO, etc.) typically solve.
As a caveat, we note that some policy gradient methods actually train with \(\beta=0\) for RLVR problems. However, from RL’s Razor we know that on-policy RL methods implicitly bound KL divergence, making this a reasonable assumption.
Privileged information, and why good teachers don’t cheat
Now suppose we have privileged information \(z \sim \rho(\cdot \mid x)\) (for example, \(z \in \mathcal{Y}\) is an expert demonstration of a problem \(x\)) that makes it easier to obtain high rewards. In the simplest case \(z = f(x)\) for some deterministic function \(f\) (e.g., a lookup table over stored answers).
Fix \(z \sim \rho(\cdot \mid x)\) and parameterize a distribution \(g(\cdot \mid x, z)\) with \(g(\cdot \mid x, z) \ll \pi_0(\cdot \mid x)\). Applying the ELBO bound from above, we have almost trivially: \[<br>\begin{aligned}<br>\beta \log Z(x) &\geq \mathbb{E}_{y \sim g(\cdot \mid x, z)} \left[ R(x,y) \right] - \beta\, \KL{g(\cdot \mid x, z)}{\pi_0(\cdot \mid x)}.<br>\end{aligned}<br>\]
That is, for any \(z \sim \rho(\cdot \mid x)\), the optimal \(g(\cdot \mid x, z)\) that achieves the bound is
\[<br>g(y \mid x, z) = \pi^*(y \mid x),<br>\]
which does not depend on \(z\)!
The KL term is what makes this work, despite \(\beta\) being possibly vanishingly small. Suppose \(z\) is an expert demonstration containing the final answer, say \(42\). Then \(g(y \mid x, z) = \mathbf{1}_{\{y = 42\}}\) achieves the optimal reward, but it has extremely high KL divergence from \(\pi_0\), and is therefore not feasible to distill into the student. It is difficult to distill a teacher which simply repeats the final answer into the student.
RL, self-distillation, and Pedagogical RL are all equivalent
This gives us two ways to optimize the same \(\mathbb{E}_\pi[R] - \beta\KL{\pi}{\pi_0}\) objective:
Start from \(\pi(y \mid x) = \pi_0(y \mid x)\). Sparse rewards, but \(\KL{\pi_0}{\pi_0} = 0\). GRPO and the like.
Start from \(\pi_0(y \mid x, z)\). Dense rewards, but large KL. Pedagogical RL and on-policy-distillation follow this paradigm.
Importantly, KL divergence is significantly easier to optimize than sparse rewards, since rich gradients can be derived from full per-token logits (see this).
The above motivates the following two-stage procedure:
Train the teacher \(g(y \mid x, z)\) to optimize \(\mathbb{E}_g[R] - \beta\KL{g}{\pi_0}\). At optimality, \(g(y \mid x, z)\) loses its dependence on \(z\).
Distill the teacher \(g(y \mid x, z)\) into the student \(\pi(y \mid x)\) via KL minimization, since \(z\) is not available at test time. The direction of the...