Nvidia-ZPPO: Zone of Proximal Policy Optimization

gmays1 pts0 comments

NVIDIA-ZPPO · Zone of Proximal Policy Optimization

ZPPO

Zone<br>of<br>Proximal<br>Policy<br>Optimization

NVIDIA

Teacher in<br>Prompts<br>Not<br>Gradients

Byung-Kwan Lee&dagger;

Ximing Lu

Shizhe Diao

Minki Kang

Saurav Muralidharan

Karan Sapra

Andrew Tao

Pavlo Molchanov

Yejin Choi

Yu-Chiang Frank Wang

Ryo Hachiuma

&dagger;Project Lead

ArXiv 2606.18216

Code Internal-Use Only

Coming soon to public

Models Internal-Use Only

Coming soon to public

TL;DR

Method

10 LLM Benchmarks

16 VLM Benchmarks

5 Video Benchmarks

Off-Policy Distill†

0.0

0.0

0.0

On-Policy Distill†

0.0

0.0

0.0

GRPO†

0.0

0.0

0.0

GRPO† + Teacher response

0.0

0.0

0.0

ZPPO (Ours)

0.0

0.0

0.0

†: prompt replay buffer · all experiments run on Qwen3.5

Problem

Please click if you want to know the problem<br>👉

1 / 3<br>Off-Policy Distill† and On-Policy Distill†

Distillation forces a student to imitate teacher logits, inducing memorization on the training samples while degrading generalization on unseen samples. (Overfitting on dataset and teacher)

2 / 3<br>GRPO†

RL lets model have freedom of responding the question until they solve it, encouraging reasoning exploration via self-reflection like "Wait, that step looks wrong — let me re-check." (Not forced to imitate any response) — preserving generalization . However, RL can't learn how to solve hard questions whose rollout accuracy is near zero — they are discarded forever .

3 / 3<br>GRPO† + Teacher response

To solve hard questions, some RL methods naively inject the teacher's response into the student — as if it were the student's own response — breaking the on-policy assumption , degrading generalization again .

SwipeUse ← → or the arrows to browse problems

Insight

Skip animation (I don't have time)

Research Question

For hard questions , how can we transfer the teacher's knowledge to the student without imitating the teacher's logits or injecting the teacher's response directly into the student's gradient? How to make the student solve the hard question without policy drift (degrading generalization)?

Inspiration

Vygotsky, L. S. (1978). Mind in Society. Harvard University Press. 200,000+ citations

The concept of zones is proposed, where Zone of Proximal Development is introduced to help solving the hard question by an educational approach.

Lev VygotskyProfessor of Psychology

Can't do<br>Can do with help<br>Can do alone

Can't do Too hard, even with help

Can do with help Zone of Proximal Development<br>reached with educational help◎ target

Can do alone Easy enough without help

Through Our Lens

The same three zones, reinterpreted as rollout accuracy .

Byung-Kwan Lee et al.

Can't do<br>Can do with help<br>Can do alone

Can't do Rollout accuracy is zero<br>no matter how we help.

Can do with help Zone of Proximal Policy OptimizationRollout accuracy is zero or low<br>but its accuracy rises with help.◎ target

Can do alone Rollout accuracy is already high.

Solution

If a hard question (rollout accuracy near zero) is given, we are doing

Question Reformulation<br>— causing no policy drift !

If Teacher Can Solve

BCQ Binary Candidate-included Question

Question<br>⟨ original question ⟩

Here are two responses to the question above. One is correct and another is wrong. Use these as references to help you solve the problem.

⇅ shuffle

{Correct Teacher Response}

{Wrong Student Response}

If Teacher Can't Solve

NCQ Negative Candidate-included Question

Question<br>⟨ original question ⟩

Below are the incorrect reasoning processes.

{Wrong Student Response}

{Wrong Student Response}

{Wrong Student Response}

Effect

BCQ<br>BCQ asks the student to solve the problem afresh while consulting the two candidates, thinking and reasoning out which one is correct .

NCQ<br>For the first time, the student confronts all of its own failed attempts at once — cued to recognize the shared error patterns and avoid them .

method<br>Technically, we use a Replay Buffer to store hard questions , so the model revisits each hard question many times — not just once, as in GRPO. Repeated exposure strengthens the BCQ/NCQ effect on each hard question , which we expect to lift its rollout accuracy .

This question has a teacher-correct rollout, so we build a BCQ — as in (b).

This question has many student-wrong rollouts, so we also build an NCQ — as in (c).

A Batch is formed from new questions (Database ) and replayed questions (Replay Buffer — stores hard questions).

A question is sampled from the batch, and both the Student and Teacher run rollouts on it.

If the question's student rollout accuracy is below 50% , it is admitted into the Replay Buffer .

Batch includes new questions, replayed questions, BCQ , and NCQ — Student is RL-trained on them.

results<br>A question is admitted to the Replay Buffer when its rollout accuracy stays below 50% , and it graduates — leaving the buffer — once that accuracy reaches 50% . ZPPO graduates far more hard questions than GRPO, and the gap is widest where the initial accuracy starts near zero...

question student teacher hard accuracy policy

Related Articles