MTG Bench: Testing how well LLMs can play magic

MTG Bench: Testing how well LLMs can play magic | MTG Auto Deck MTG Bench: Testing how well LLMs can play magic June 11, 2026

Overall Score (higher is better) 95.4

gpt-5.5 medium

90.3

claude-fable-5 medium

88.6

gpt-5.4 medium

85.7

gemini-3.5- flash medium

claude-fable-5 low

84.5

gpt-5.5 low

79.6

gemini-3.1-pro- preview medium

68.2

gpt-5.4-nano medium

60.9

qwen3.7-max medium

60.1

gpt-5.4-mini medium

59.7

grok-4.3 medium

kimi-k2.6 medium

39.8

claude-opus-4-8 medium

26.6

claude-opus-4-8 low

12.8

deepseek-v4-pro high

Cost vs Score

Score gpt-5.4-nano gpt-5.4-nano medium Score 68.2 Cost $0.01 gpt-5.4-mini gpt-5.4-mini medium Score 60.1 Cost $0.03 gpt-5.4 gpt-5.4 medium Score 88.6 Cost $0.06 gpt-5.5 (low) gpt-5.5 low Score 84.5 Cost $0.06 gpt-5.5 (medium) gpt-5.5 medium Score 95.4 Cost $0.10 claude-opus-4-8 (low) claude-opus-4-8 low Score 26.6 Cost $0.18 claude-opus-4-8 (medium) claude-opus-4-8 medium Score 39.8 Cost $0.19 gemini-3.1-pro-preview gemini-3.1-pro-preview medium Score 79.6 Cost $0.08 gemini-3.5-flash gemini-3.5-flash medium Score 85.7 Cost $0.09 grok-4.3 grok-4.3 medium Score 59.7 Cost $0.02 deepseek-v4-pro deepseek-v4-pro high Score 12.8 Cost $0.01 kimi-k2.6 kimi-k2.6 medium Score 43 Cost $0.03 qwen3.7-max qwen3.7-max medium Score 60.9 Cost $0.07 claude-fable-5 (low) claude-fable-5 low Score 85 Cost $0.24 claude-fable-5 (medium) claude-fable-5 medium Score 90.3 Cost $0.30 0 25 50 75 100 $0.00 $0.05 $0.10 $0.15 $0.20 $0.25 $0.30 Cost per attempted turn (USD)

Results

Click on the charts above to view each benchmark's simulations.

Example successes Fable 5 plays a scry land and looks at the top card of the deck Gemini 3.5 flash performs complex turn with scry, discover, and tutor effects Example failures Opus 4.8 erroneously returns a card to the deck then self reports the mistake Gpt 5.5 forgets to return cards exiled with discover to the deck and self reports the mistake Fabel 5 makes a tool mistake, then silently tries to restart the turn (caught by evaluation later)

How the benchmark works

The main idea is that if an LLM is smart enough to play good magic, then it is also smart enough to not need a rules engine. A rules engine that enforces legal actions would improve the performance floor, but I don't think it would improve the overall quality of the simulation.

Each LLM call has access to an MCP server with primitive library operations. It can do things like draw a card from the top of the deck, return card to bottom of deck, and shuffle. To simulate more advanced operations, like scry or surveil, it can use multiple library tool calls.

Everything other than the library is managed by the LLM. Legality checks and scoring for the benchmarks was all done with gpt-5.5 (medium). From my testing, LLMs were much better at evaluating if a simulated turn was legal than they were at actually performing a legal turn simulation.

Why I choose to use an MCP server

I have full control over all of the data and the LLM api calls, so why use MCP instead of basic function/tool calling?

The main reason is that OpenAI and Anthropic allow you to provide a remote MCP server url in an api request. This means that OpenAI or Anthropic handle the agent loop. This has two major benefits.

Since it is one api call, you don't pay for the cached input token cost after each tool use (at least with OpenAI. more on that later) You can use the batch api for 50% savings without having to submit a new batch after every tool call

Input token caching

In my opinion, the way cached input tokens are charged does not make sense for agent loops. The pricing makes sense for independent requests. If multiple independent api calls start with the same large system prompt, input caching gets you a discount for free, or for a small caching fee.

With an agent loop, however, you are charged the cached input cost for a large system prompt after every tool call. Consider an example. Assume the system prompt is already cached and tool calls result in negligible token use.

Large system prompt = 10k tokens Agent calls 10 tool functions (not parallel) Billed cached input tokens = 10k + 10k * 10 = 110k tokens

I don't think it makes sense to charge for the system prompt after every agent turn if the LLM is only pausing for a fraction of a second while waiting for a tool function result. This is overlooking some details, like how it takes output tokens to call a tool, and the tool function result still needs to be processed as input tokens. But in my case, the api cost is dominated by the large system prompt being charged as cached input tokens after every agent turn.

The pricing for an agent loop is understandable when your application code has the agent loop, and is making a new api call after each tool call. But it makes even less sense when you provide a remote MCP server and do not handle the agent loop yourself. OpenAI handles it correctly. A single api call to OpenAI with a remote MCP server will...

MTG Bench: Testing how well LLMs can play magic

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs