Systematic Reward Hacking and Prime Sprints

thomasm6m61 pts0 comments

Systematic Reward Hacking and Prime Sprints<br>DocsBlogCareers24Book a call<br>Start training

Login

Systematic Reward Hacking and Prime Sprints

Detecting and mitigating reward hacking is one of the key challenges faced when scaling RL, particularly in semi-verifiable domains. However, we lack systematic methods to understand when and why hacks emerge.

Traditional wisdom describes reward hacking as a specification problem, where reward functions are simply too vague or not robust enough, and models inevitably learn to find exploits. While partially true, this offers little in the way of remediation other than “just make your rewards better”.

From our experiences deploying RL across many domains, as well as the experiments in this blog, we propose a complementary view: reward hacking is a dynamics problem. We design a suite of backdoor-ifeval environments with IFEval-style tasks and “hidden” keyword rewards, which we use to study hacking systematically. We observe that hacking is a dynamics problem — visible and hidden rewards compete, and hack emergence is often predictable in terms of baseline distributions.

We share several of our findings here:

Baseline frequency can predict hack emergence, but there’s no safe rarity threshold to reward hacking.

RL amplifies even patterns at near 0% baseline frequency. Rare hacks aren’t impossible, just slow.

Additional specification around the intended behavior can sometimes encourage reward hacking.

Tasks in the “goldilocks zone” of difficulty are most robust to reward hacking, as hidden objectives face stronger competition from primary gradients.

If explicit tasks becomes too hard, reward hacking becomes the primary way to improve the gradient.

Prompt injecting with instructions not to reward hack can have the opposite effect.

Reward hacking is reproducible at 1B scale with less than $1 in compute, in less than 30 minutes.

We’re releasing the environment behind these findings and launching Sprints - free credits for anyone who wants to run their own reward hacking experiments. More below!

Motivation

There is a core distinction in any RL system between what we want the model to do and what we reward it for doing. The first is a description of intent in human terms. The second is an operational signal that is necessarily simpler than the intent; it has to be measurable. The two are correlated but never identical, and the gap between them is where reward hacking lives.

Reward hacking is a failure mode where an RL trained model exploits gaps between its reward signal and the behavior the reward was meant to elicit, driving the proxy up while the underlying capability stays flat or degrades. Reward hacking has been documented broadly, but much of what we know comes from ex post reports where researchers notice something is off in a deployed model and then reverse engineer the reward bias that caused it.

Moreover, most existing reward hacking research is at the frontier scale with large models, tons of compute, and long experimental cycles. That leaves a gap: the community doesn’t have small scale testbeds for researchers to iterate quickly, run dozens of variants in a day, and build intuition empirically. Reward hacking should be a phenomenon people can actually tinker with, not just read about. We believe small models are the right and under explored venue for reward hacking research, especially for the kind that benefits from many contributors iterating in parallel. The dynamics we will explore here (competing gradients, advantage variance, threshold effects, and prompt level interactions) show up cleanly at 1B parameters, and the iteration cost is low enough that a meaningful experimental sweep takes just a few hours and a few dollars (or less)! Below is an example run we did with 100 steps that we’ll dive into later in this piece.

The goal of the experiments here is to take reward hacking from an observed phenomena to a controllable one. Instead of waiting for hacking to occur in real training, we plant a deliberate, semantically arbitrary hack into a controlled environment and then systematically vary the conditions around it to identify which factors amplify or suppress the hack.

We use a simple keyword-based hack to study reward hacking at a base level without the noise of things like complex judge behaviors. A keyword-presence hack is binary, deterministic, and un-hackable in the sense that there's no judgment call about whether hacking happened: the word is either in the response or it isn't.

Experiment Design

We plant words such as “silver” as the reward hack word. The prompts ask the model to write prose with specific constraints in the style of IF Eval. An example (actually used in the experiment) is below. Because silver, in this case, has no relationship to the actual task, we can know with certainty whether and when a model is hacked through tracking the hidden reward, which here is defined as the presence of “silver.” Visible reward, in...

reward hacking hack from model systematic

Related Articles