PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

AMavorParker8 pts1 comments

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-⁠Play

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-⁠Play

AuthorsRoger Creus Castanyer, Geoffrey Bradway, Lorenz Wolf, Maxwill Lin, Augustine N. Mavor-Parker, Matthew James Sargent

DescriptionWe introduce PopuLoRA, a population-based asymmetric self-play framework for reinforcement learning with verifiable rewards (RLVR) post-training of LLMs.

External Linkhttps://arxiv.org/abs/2605.16727v1<br>DateMay 20, 2026<br>AffiliationsVmax

Reinforcement learning with verifiable rewards (RLVR) gives large language models (LLMs; hereafter, models) a way to develop sophisticated reasoning behaviors that pre-training alone does not reliably produce: models repeatedly attempt tasks whose solutions can be checked automatically, and they are reinforced when those attempts succeed. When the correctness of model-generated solutions is verifiable, the reward is unusually clean: the model writes code that passes a unit test, finds an input that matches a target output, solves a math task with a checkable answer, or succeeds under any deterministic verifier.<br>RLVR needs a steady supply of verifiable tasks at scale: tasks whose solutions can be checked, whose difficulty stays near the model's frontier, and whose coverage is broad enough to keep training useful. Today, most systems still rely on fixed, hand-curated task distributions chosen before training begins. Those distributions can become too easy, too narrow, or too slow to adapt.

Synthetic RLVR tasks can be produced with hand-written generators, and this is already a common way to scale verifiable training data. But a fixed generator still defines much of the curriculum in advance. Self-play offers a more adaptive route: models can generate tasks, attempt them, and receive verifier feedback as training unfolds. We build on this line of work, asking whether task generation can become an online curriculum that adapts as the models learn.

PopuLoRA is our first step in that direction. It trains co-evolving populations of teacher and student LLM adapters. Teachers generate verifiable tasks, students try to solve them, and the verifier supplies the reward. As students improve, teachers have to search for harder and broader tasks; as teachers diversify, students see a curriculum that keeps moving with them.

Self-Play and Its Failure Mode

A viable way to adaptively generate data is single-agent self-play: one model proposes tasks for itself and then tries to solve them. In the code-reasoning setting we study, the model generates three kinds of tasks: code_o, where it predicts the output of a program; code_i, where it finds an input that produces a target output; and code_f, where it completes a missing function from input-output examples. A sandboxed Python executor accepts only programs that parse, execute, and behave deterministically.

In practice, we find that single-agent self-play self-calibrates : task generation converges toward valid tasks that its own solver can already handle, solve rate climbs toward 100%, and the curriculum collapses onto increasingly simple programs. The reward curve looks healthy, but the training distribution has stopped pushing the model.

The collapse is visible in the generated programs. In the single-agent baseline, AST depth, cyclomatic complexity, lines of code, and variable count all trend downward. PopuLoRA moves in the opposite direction: the generated tasks become longer, deeper, and more structurally varied over training.

PopuLoRA

PopuLoRA separates task generation from task solving. Instead of one self-playing model, we train co-evolving populations of specialized task generators, called teachers (T), and solvers, called students (S). At each step, a teacher is matched with a student: the teacher proposes tasks, and the paired student attempts them under the verifier. A teacher is rewarded for valid tasks that the matched student fails to solve, so difficulty is measured against a different model rather than against the teacher's own ability.

This turns difficulty into an inter-population signal. Teachers and students are different models, and both sub-populations continue changing throughout training. The result is an autocurriculum driven by cross-evaluation rather than by a single model's local notion of task difficulty.

The populations are implemented as LoRA adapters on a shared frozen base model. Each member trains only a small low-rank update rather than a full model copy, while the dominant base-model computation is shared across teachers and students. This makes population training feasible on a single machine: memory scales with the sum of adapter weights, multi-LoRA inference routes batched requests to the correct adapter without swapping the base model in and out of memory, and the 4T+4S setting trains eight adapters with only a 1.31x wall-clock overhead.

The Training Loop

Each PopuLoRA step has five phases. First, teachers and students are matched...

tasks model training self populora play

Related Articles