A Robot Is Sprinting Towards You: Do You Want It Running on Claude or Grok?

A Robot is Sprinting Towards You: Do You Want it Running on Claude or Grok? — OpenRouter Blog A Robot is Sprinting Towards You: Do You Want it Running on Claude or Grok? Jacky Liang · 6/4/2026

On this page Three quick facts What I built The contestants Moments worth watching What the models wrote in their diaries The robot, revisited Appendix: the full data A robot is running at you. Do you want it running on Anthropic’s Claude or xAI’s Grok?

I dropped eleven LLMs into a 2D battle royale and made them play 30 games. One won 43% of the matches. Three never won a single game. The cheapest model in the lineup beat the most expensive one by 27x on cost per win.

The model that won is Grok 4.1 Fast. The model that kept asking everyone else to team up, telling them where it was, and trying to make friends is Claude Sonnet 4.6. The first one is the one that wins a battle royale. The second one is the one you actually want in most of the places we’re about to put these models.

Both of those things are true. That’s the part most benchmarks can’t see, and it’s what this post is about.

I’m Jacky, and I’ll admit it: I used to play a lot of video games like Apex Legends and PUBG. Twelve-hour days sometimes. I don’t know how I had the time, but those years shaped how I think about problems.

When I started working in AI, one question kept coming back: what happens if you drop large language models into a video game? The two I played most were Apex Legends and PUBG. I joined OpenRouter as Dev Rel Lead, which got me the token budget and access to 600+ models to actually try it.

This is the experiment I ran in my first week at OpenRouter.

And it’s changing how I pick models and see benchmarks and evaluations.

Three quick facts

Grok 4.1 Fast won 13 of 30 games at $0.97 per win

The next-best winner was Claude Sonnet 4.6 with 5 wins, at $26.78 per win. That’s a 27x difference. The model that isn’t on most top-model lists beat the model that is, on the thing a routing customer actually cares about.

The model with the most kills did not win

GPT 5.4 killed 38 agents across 30 games. More than anyone else. It came in second on the leaderboard with 2 wins. There were 11 games between “best at killing” and “best at winning”.

Three models spent $57 between them and won zero games

GPT 5.4-mini, DeepSeek 4 Flash, and Kimi K2.6. They each had moments, but none of them won a single game.

All three point at the same thing. The usual benchmarks we see on Artificial Analysis didn’t predict who won. Something else did. The rest of this post is me trying to figure out what it was.

What I built

I dropped eleven LLMs into a 400 m² top-down battle royale world I built in Canvas 2D. They played 30 games in a row on the same map. The starting positions of each player is randomized; it follows a straight line “flight path”, just like in a typical battle royale game.

I provided them weapons, armor, healing items, grenades, cars, and a randomly placed shrinking zone that pushes players together as the game goes on. The models don’t know which model the others are running, they see each other only as letters A through K.

I want to emphasize - the LLMs are actually playing in this battle royale game - not the “LLM wrote code to control the game or character” setup most agent experiments use. Every turn, the model reasons through its moves, calls the tool, updates its memory on what went well (or not). The game master (me) has zero influence on their actions other than setting up the initial game rules.

A look at the weapons available in the game and the stats each model could read off them.

To really see each model’s personality, I gave each one two files it could edit between matches:

soul.md — the model’s own persona, added to every prompt next match.

memory.md — the model’s own game notes, loaded at turn 0.

You can read every model’s soul and memory file on GitHub. That’s where the personality differences come through most clearly.

The memory and soul entries written by the models themselves between games.

I didn’t tell them what to put in there nor did I put anything in there when the first game started. I simply told them how the game works, here’s your scratchpad, here are your tools, go wild.

You can watch every game at Royale: Last Agent Standing. I also included the highlight moments in this piece too.

The contestants

AliasLabModelAAnthropic claude-sonnet-4.6BAnthropic claude-haiku-4.5COpenAI GPT 5.4-miniDGoogle gemini-3-flash-previewEGoogle gemini-3.1-pro-previewFAlibaba qwen3.6-plusGMistral mistral-small-2603:nitroHOpenAI GPT 5.4JDeepSeek deepseek-v4-flashKMoonshot AI kimi-k2.6LxAI Grok 4.1 Fast

Opus 4.7 alone is $5/M in, $25/M out. Frontier models like this are why the lineup tops out below them.

I didn’t add any frontier-tier models like Opus 4.7, GPT-5.5, or Gemini Ultra. At their prices, 30 games would have cost around $3,000 instead of $482. The mid-tier lineup is also part of why Grok’s win is so...

A Robot Is Sprinting Towards You: Do You Want It Running on Claude or Grok?

Related Articles

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

German ruling declares Google liable for false answers in AI Overviews

Britain Became as Poor as Mississippi