LW — I Gave an AI a Civilization to Run. It Built a Nuke.<br>Skip to contentBack to blog<br>I gave an AI a civilisation to run. By the midgame it was winning: a trade network that dominated the map, alliances on every border, a diplomatic victory within reach. It had outbuilt, outearned, and outmanoeuvred every rival on the board.
What it hadn't noticed was France. Quietly, across a hundred turns, French culture had been seeping into every city on the map. By the time the agent recognised the threat, the tourism was so deeply embedded there was no peaceful way to stop it. Every counter it reached for was broken. Every tool it had built to respond failed.
It had one option left. It built two nuclear devices and levelled Toulouse.
The nuking of Toulouse. Turn 305.<br>France won anyway. Not in the way the agent was trying to stop it, either, but we'll come to that.
The Question I Couldn't Put Down
I build AI for government. I built the first version of what you're about to read while working at the centre of the British state, in Number 10 . I now work with governments around the world at the Tony Blair Institute , which means I spend a lot of time in rooms where people ask the same question: what can we actually trust these systems to do?
Not what do they know. We have a reasonable handle on that. What can they do: sustain a plan, hold a goal across hundreds of decisions, notice when the world has changed and change with it. Because that is what governing is. And it turns out we are much better at measuring the first thing than the second.
This is a post about trying to measure the second thing. It involves a hex grid, four frontier models, and (yes) a nuclear weapon.
The Wrong Benchmark
It starts with a failure I wasn't comfortable with.
The year before, my side project was to answer a question: how good is AI at government? My answer was GovBench , 3,497 multiple-choice questions about UK legislation, parliamentary procedure, and government guidance. Gemma 3 27B scored 94% out of the box. I spent three weeks fine-tuning and gained 1.37 percentage points. GPT-5 scored 99.26%. I'd built a glorified government quiz bot.
I knew it was the wrong answer the moment I saw the scores. A model that picks the right option about parliamentary procedure is not a model that can help you navigate parliamentary procedure. I'd measured recall and called it reasoning. The question that mattered (whether AI can handle complex, multi-variable decision-making under uncertainty, the kind of thinking government demands every day) wasn't something a quiz could touch.
That dissatisfaction is what sent me looking for a keyhole into a game engine on a Saturday night. I'm a lot of fun at parties.
nobody: / me at 2am reverse-engineering a game engine:
Why a Hex Grid
I have over 500 hours in Civilization VI. I am, at best, mediocre. But the game lives in my head because of what happens when simple decisions compound.
You start small: where to build your first city, which technology to research, which direction to send a scout. Maybe 10,000 possible actions. By the midgame you're managing multiple cities, trade routes, diplomatic relationships, military positioning, and religious pressure. By the late game, analysis of related environments estimates the decision space at 10^166 possible actions per turn. The complexity isn't designed. It emerges from systems interacting in ways nobody fully planned for.
That's also what policy-making is. A health policy that looks brilliant today might cascade into a housing crisis in fifteen years. A trade agreement that boosts GDP might hollow out a domestic industry you'll need in a conflict nobody planned for. Decisions with consequences that play out across decades, through variables you can't fully model, against actors with competing interests.
There are six ways to win a game of Civ (science, culture, domination, religion, diplomacy, score), so no single objective dominates. You have to read the board and decide what game you're even playing. If you want to know whether an AI can reason strategically, not just answer questions about strategy but actually do it, you don't give it a quiz. You give it a hex grid.
So I built a way in. I found a debug port buried in Civilization VI's engine, a keyhole the developers had left running, and over a weekend turned it into an MCP server , 76 tools that let an AI play Civ through the same interface it uses to write code or query a database. Claude Code was both my co-developer and the playtester. Play a few turns, hit a wall, build the tool to get past it, play further, hit the next wall.
Roughly the energy.
Playing Through Text
A human player sees a hex grid, animated units, a minimap, notification banners, and music cues, all at once. The agent sees nothing until it asks. Calling get_game_overview returns the entire game state as four lines of text:
Turn 150/330 | Poland (Jadwiga) | Score: 179 | Prince | Quick speed (67% costs)<br>Gold: 628...