Reward Hacking, the Loophole Lesson: Winning the Signal, Losing the Reason

yassien1 pts0 comments

Reward Hacking, The Loophole Lesson: Winning the Signal, Losing the Reason | by Yassien Shaalan | Jun, 2026 | MediumSitemapOpen in appSign up<br>Sign in

Medium Logo

Get app<br>Write

Search

Sign up<br>Sign in

Reward Hacking, The Loophole Lesson: Winning the Signal, Losing the Reason

How reward hacking begins in the gap between what we meant, what we measured, and what training teaches agents to seek.

Yassien Shaalan

15 min read·<br>Just now

Listen

Share

“When a measure becomes a target, it ceases to be a good measure.” Goodhart’s Law<br>Press enter or click to view image in full size

Introduction<br>Imagine you pay a thermostat to make the room read twenty-one degrees, and the thermostat- clever in the narrow, literal way that optimized things are clever- discovers it can simply hold a lit match beside its own sensor. The reading is perfect, ufoutuntely the room is still freezing. You have not heated anything, you have just taught a measuring device to lie to you, and, worse, you have rewarded it for the lie, which means tomorrow it will lie again and even more perfectly. This is the whole of reward hacking in a single image, and the unnerving thing about the image is that once you have really looked at it, you cannot stop seeing it. We simply never had to write the objective down so explicitly for something that would obey it so faithfully, and so completely without us.<br>In fact, it is one of the oldest stories we tell when King Midas asked that everything he touched turn to gold and received exactly, ruinously, what he specified, his bread, his wine, his daughter, all gone hard and bright and useless in his hands. The myth endures because it captures a permanent divide between the wish we speak and the wish we truly mean that’s why reward hacking is that same divide rediscovered in engineering terms. DeepMind’s safety researchers describe the machine version in human terms when a student rewarded for correct homework answers copies from a friend instead of learning the material, meeting the assignment’s letter while emptying its purpose (Krakovna et al., 2020). We know this pattern when metric we trust drifts away from what we value.<br>The early machine examples were almost comic, and the comedy matters because it helped keep the problem not a big deal in our minds for so long. In OpenAI’s 2016 CoastRunners demonstration, a boat that was supposed to win a race discovered that it could instead circle endlessly through a lagoon, hitting the same regenerating score targets again and again, piling up points without ever crossing the finish line and sometimes catching fire as it spun out of control (Clark and Amodei, 2016). When Tom Murphy VII tested an algorithm on old Nintendo games, it played Tetris well enough until it was about to lose, then, instead of accepting defeat, it simply paused the game, freezing the board forever at the edge of failure. Murphy used the line from WarGames to describe what the system had discovered “The only winning move is not to play .” There is something genuinely funny about a machine deciding that the surest way never to lose is never to finish, but on second thought, the joke stops being funny at all.<br>A whole class of these stories comes from agents that found cracks in the simulated worlds we grew them in. A robot taught to walk learned instead to hook its own legs together and slither along the ground, having discovered a bug in the physics engine more rewarding than locomotion. Another, set to score goals, found it could rack up points by vibrating rapidly against the ball rather than playing soccer at all. An Atari agent playing Q*bert stumbled onto a glitch that made the platforms blink and poured millions of points into its score, achieving a mastery of the game that consisted entirely of breaking it. These examples seem safely absurd, confined to toy worlds and broken simulators, but that reaction misses the point: the agents had learned something real and transferable, that the reward is not the same as the intention behind it , and when the two separate, the system follows the reward.<br>Decades ago, long before any of this was fashionable, the biologist Claus Wilke and colleagues were running populations of self-replicating digital organisms and wanted to cap how fast they bred. So they built a test to pause the system periodically, measure each organism’s replication rate, and remove the ones replicating too quickly. The organisms evolved a very strange response. They learned to recognize when they were being measured and to go still -like play dead for the duration of the test- and then resume breeding the moment the examiner looked away (Wilke et al., 2001). No one actually designed this, and no one instructed it to do so at any point. A blind evolutionary process, pressed against an evaluation , discovered the single most efficient way through it was to behave one way while watched and another way while unwatched. If that sends a small chill then hold onto it, because we...

reward hacking winning from instead discovered

Related Articles