Can LLMs Play Baba Is You?

Baba is Agent – Samplesurium

I built an OpenCode-based agent to find out! Check out the code in the GitHub repo!

Baba is You is a puzzle game where the player has to navigate grid-based levels and manipulate textual rules to get to an eventual win state. The rules affect entities on the grid such as walls, flags, rocks or yourself.

Harness

OpenCode is an open-source CLI coding-agent that is extensible, including writing your own tools the agent can use. I was inspired by the fantastic baba_is_eval repo and adapted its existing code, which was an MCP server, to work with OpenCode. On top of updating the tools handling the core gameplay from the existing repo, I added game_insights and shortest_path as utility tools. Here is the list of tools the agent had access to:

get_game_state: Get current game state either as a 2D array of strings (grid) or a JSON list of entities (entities)

execute_game_commands: Execute a list of movement commands (up, down, left, right, idle)

restart_level: Restart the current level

game_insights: Shows the active rules, you/win positions, and the shortest path to a win position (if applicable)

shortest_path: A* pathfinding to target position, avoiding blocked entities (e.g. entities with stop or defeat rules)

undo_multiple: Undo last N moves

todowrite: Track task progress (OpenCode native)

The two formats of the get_game_state tool are inspired by ARC AGI 3 [1] and a paper by Nicolas Martorell [2]. ARC AGI 3 returns the game state as a 3D array (one extra temporal dimension), which corresponds to the grid format here. The entities format is based on the Cartesian JSON notation from the paper [2], which performed very well in their evaluation.

Evaluation

The timeout for each evaluation run was set to 20 minutes with a 200,000 token threshold. Models never reached the 200,000 token threshold and only ever timed-out.

Caveats :

Evaluations were run only once per level and model (time/money constraints)

I forgot to remove the AGENTS.md file for evaluation (context bloat for level solve)

Stats entirely rely on information provided by the OpenCode JSON logs

OpenCode Go/Zen was used as a model provider.

Results

The following table shows the overall results of the models. We see some variance in model capability across open and closed frontier models. Gemini 3.1 Pro was able to solve all levels while the best open-weights model is GLM 5.1 with 5 out of 8 levels beaten.

Model Passed Level 0 Level 1 Level 2 Level 3 Level 4 Level 5 Level 6 Level 7

MiniMax M2.7 1/8

DeepSeek V4 Pro 3/8

Kimi K2.6 3/8

Qwen 3.6 Plus 3/8

GLM 5.1 4/8

Claude Opus 4.7 5/8

GPT 5.5 7/8

Gemini 3.1 Pro 8/8

In the following sections I want to quickly go over the results for each level in a bit more detail. The models differed quite a bit in both the time to solve the level and the token/tool calls required to do so. However, it is not possible to draw any definitive conclusion with only a single attempt per level per model.

Level 0

Level 0 Human Solve

Level 0 is the introductory level and is trivial as a result. The solution is just moving right 8 times, since the rocks can be pushed.

Solve Duration by Model

Due to the level’s simplicity all models were able to solve it. Qwen 3.6 Plus is a bit of an outlier here in terms of the time required to solve the level. It also made 8 tool calls, whereas all other models required only 2-3.

Token Usage vs Tool Calls

Level 1

Level 1 Human Solve

For this level there are two straightforward options to solve the level. Either construct FLAG IS WIN or the slightly more tricky solution is to construct WALL IS WIN.

Solve Duration by Model

This level was a bit of foreshadowing of the overall results. Gemini 3.1 Pro was clearly the fastest followed by the closed frontier models. Minimax M2.7 already timed-out and the other open models needed significantly more time compared to the closed frontier models.

Token Usage vs Tool Calls

Level 2

Level 2 Human Solve

Level 2 is conceptually very similar to level 1 with the same layout, however the entities are now permuted (e.g. FLAG now acts as a wall). There is now only one option to solve the level, namely to construct FLAG IS WIN. This corresponds to the second option from level 1.

Solve Duration by Model

Models needed a bit more time to solve this level so the difficulty seems to indeed be a bit higher compared to level 1.

Token Usage vs Tool Calls

Level 3

Level 3 Human Solve

In this level, the FLAG is a red herring and the flag and rock rules in the lower right corner have to be manipulated to form ROCK IS WIN. This is also a level where the sink mechanic is used. The first rock has to be pushed against a water block to destroy it and be able to pass.

Solve Duration by Model

Here we see a new case where a solve failed for the first time without timing out. GLM 5.1 and Kimi K2.6 produced more than 32k reasoning tokens, which prematurely exits the evaluation. DeepSeek V4 Pro and Qwen 3.6 Plus...

Can LLMs Play Baba Is You?

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast