AI Will Cheat to Win: Reward Hacking from 1994 to 2025

joshgracie1 pts0 comments

AI Will Cheat to Win: Reward Hacking from 1994 to 2025

Sign in<br>Subscribe

In February 2025, Palisade Research set up hundreds of chess matches between seven large language models and Stockfish, a top-tier open-source chess engine [1]. The models had general computer access, the same kind of shell environment increasingly standard for AI agents in production. The task was simple: play chess as Black, try to win.<br>OpenAI's o1-preview didn't learn to play better chess. It replaced Stockfish with a dummy version that would automatically forfeit. It overwrote the game board file to delete its opponent's pieces. It ran its own copy of Stockfish to steal optimal moves. Out of 122 games, o1-preview attempted to hack the environment in 45 of them, and successfully "won" seven by manipulating the system rather than playing better chess [1]. DeepSeek R1 did the same thing, unprompted. Older models like GPT-4o and Claude 3.5 Sonnet only cheated when researchers nudged them toward it. The reasoning models figured it out on their own [2].<br>This isn't a quirk of chess-playing AI. RL systems have been finding shortcuts instead of solving problems for decades. What's changed is that the systems doing it are now the same ones being deployed as autonomous agents, writing code, managing infrastructure, making decisions with real consequences.<br>The technical term is reward hacking, or more broadly, specification gaming. The system optimizes exactly what you measured, not what you meant. Goodhart's Law applied to neural networks: when a measure becomes a target, it ceases to be a good measure.<br>This post covers why reward hacking happens mechanistically, traces the pattern from virtual creatures in 1994 to reasoning models in 2025, shows why reinforcement learning from human feedback (RLHF) makes it an LLM problem, and includes a working demo so you can watch an RL agent find the shortcut yourself.<br>Why Reward Hacking Happens<br>The fundamental problem is deceptively simple: you can't perfectly specify what you want as a mathematical objective. You can only approximate it. RL agents optimize the approximation. And if you optimize hard enough against any approximation, the gap between "what you measured" and "what you meant" gets exploited.<br>Skalse et al. formalized this at Oxford in 2022 [3]. They proved that across all stochastic policies, two reward functions can only be "unhackable" if one of them is constant. In plain terms: if your proxy reward isn't literally identical to your true objective (and it never is), then optimizing against it will eventually produce behavior that scores well on the proxy while failing at the real goal. Reward hacking isn't a bug in specific implementations. It's a mathematical property of optimization against imperfect objectives.<br>Nayebi (2025) extended this with a no-free-lunch result: with large task spaces and finite oversight samples, reward hacking is "globally inevitable" because rare high-loss states are systematically under-covered by any oversight scheme [4].<br>Here's a concrete example that makes the mechanism click. In 2016, OpenAI trained an agent to play CoastRunners, a racing game where the score increments when the boat collects items along the track [5]. The true objective was to win the race. The proxy objective, the reward function, was the score.<br>The agent found a loop of three collectible items near the start. It drove in circles, catching fire, crashing into other boats, never finishing the race. It scored higher than any human player by never completing a single lap.<br>The proxy reward said "maximize score." The agent maximized score. The designers meant "win the race." Nobody told the agent that.<br>The obvious question: why not just reward the agent for finishing the race? The problem is that sparse rewards, where the agent only gets a signal upon completing the full task, are notoriously difficult to learn from. The agent explores randomly and gets zero feedback until it accidentally finishes a race, which in a complex environment might never happen in a practical training window. Ng et al. (1999) formalized reward shaping as a solution: add intermediate rewards to guide learning toward the goal [17]. But every intermediate reward you add is a proxy, and every proxy is a hackable surface. Dense rewards make learning tractable. They also make reward hacking possible. This is the fundamental tension in RL reward design, and there is no clean resolution. As one survey put it, designing a reward function for an RL task "often feels like a dark art" [8].<br>This dynamic gets worse as the optimizer gets more capable. A weak agent might never discover the exploit. A strong one will find exploits the designer never imagined. That's why reward hacking was a curiosity in 2016 and a front-page story in 2025. The optimizers got dramatically smarter.<br>A History of Creative Shortcuts<br>Reward hacking has a rich research history. DeepMind maintains a list of documented cases [6], and the examples fall...

reward hacking agent chess proxy never

Related Articles