MTG Bench: Testing how well LLMs can play magic | MTG Auto Deck MTG Bench: Testing how well LLMs can play magic<br>June 11, 2026
Overall Score<br>(higher is better)<br>95.4
gpt-5.5 medium
90.3
claude-fable-5 medium
88.6
gpt-5.4 medium
85.7
gemini-3.5- flash medium
85
claude-fable-5 low
84.5
gpt-5.5 low
79.6
gemini-3.1-pro- preview medium
68.2
gpt-5.4-nano medium
60.9
qwen3.7-max medium
60.1
gpt-5.4-mini medium
59.7
grok-4.3 medium
43
kimi-k2.6 medium
39.8
claude-opus-4-8 medium
26.6
claude-opus-4-8 low
12.8
deepseek-v4-pro high
Cost vs Score
Score gpt-5.4-nano gpt-5.4-nano medium Score 68.2 Cost $0.01 gpt-5.4-mini gpt-5.4-mini medium Score 60.1 Cost $0.03 gpt-5.4 gpt-5.4 medium Score 88.6 Cost $0.06 gpt-5.5<br>(low)<br>gpt-5.5 low Score 84.5 Cost $0.06 gpt-5.5<br>(medium)<br>gpt-5.5 medium Score 95.4 Cost $0.10 claude-opus-4-8<br>(low)<br>claude-opus-4-8 low Score 26.6 Cost $0.18 claude-opus-4-8<br>(medium)<br>claude-opus-4-8 medium Score 39.8 Cost $0.19 gemini-3.1-pro-preview gemini-3.1-pro-preview medium Score 79.6 Cost $0.08 gemini-3.5-flash gemini-3.5-flash medium Score 85.7 Cost $0.09 grok-4.3 grok-4.3 medium Score 59.7 Cost $0.02 deepseek-v4-pro deepseek-v4-pro high Score 12.8 Cost $0.01 kimi-k2.6 kimi-k2.6 medium Score 43 Cost $0.03 qwen3.7-max qwen3.7-max medium Score 60.9 Cost $0.07 claude-fable-5<br>(low)<br>claude-fable-5 low Score 85 Cost $0.24 claude-fable-5<br>(medium)<br>claude-fable-5 medium Score 90.3 Cost $0.30<br>0 25 50 75 100 $0.00 $0.05 $0.10 $0.15 $0.20 $0.25 $0.30 Cost per attempted turn (USD)
Results
Click on the charts above to view each benchmark's simulations.
Example successes<br>Fable 5 plays a scry land and looks at the top card of the deck<br>Gemini 3.5 flash performs complex turn with scry, discover, and tutor effects<br>Example failures<br>Opus 4.8 erroneously returns a card to the deck then self reports the mistake<br>Gpt 5.5 forgets to return cards exiled with discover to the deck and self reports the mistake<br>Fabel 5 makes a tool mistake, then silently tries to restart the turn (caught by evaluation later)
How the benchmark works
The main idea is that if an LLM is smart enough to play good magic, then it is also smart enough to not need a rules engine.<br>A rules engine that enforces legal actions would improve the performance floor, but I don't think it would improve the overall quality of the simulation.
Each LLM call has access to an MCP server with primitive library operations. It can do things like draw a card from the top of the deck, return card to bottom of deck, and shuffle.<br>To simulate more advanced operations, like<br>scry<br>or<br>surveil,<br>it can use multiple library tool calls.
Everything other than the library is managed by the LLM. Legality checks and scoring for the benchmarks was all done with gpt-5.5 (medium).<br>From my testing, LLMs were much better at evaluating if a simulated turn was legal than they were at actually performing a legal turn simulation.
Why I choose to use an MCP server
I have full control over all of the data and the LLM api calls, so why use MCP instead of basic function/tool calling?
The main reason is that OpenAI and Anthropic allow you to provide a remote MCP server url in an api request.<br>This means that OpenAI or Anthropic handle the agent loop. This has two major benefits.
Since it is one api call, you don't pay for the cached input token cost after each tool use (at least with OpenAI. more on that later)<br>You can use the batch api for 50% savings without having to submit a new batch after every tool call
Input token caching
In my opinion, the way cached input tokens are charged does not make sense for agent loops. The pricing makes sense for independent requests.<br>If multiple independent api calls start with the same large system prompt, input caching gets you a discount for free, or for a small caching fee.
With an agent loop, however, you are charged the cached input cost for a large system prompt after every tool call.<br>Consider an example. Assume the system prompt is already cached and tool calls result in negligible token use.
Large system prompt = 10k tokens<br>Agent calls 10 tool functions (not parallel)<br>Billed cached input tokens = 10k + 10k * 10 = 110k tokens
I don't think it makes sense to charge for the system prompt after every agent turn if the LLM is only pausing for a fraction of a second while waiting for a tool function result.<br>This is overlooking some details, like how it takes output tokens to call a tool, and the tool function result still needs to be processed as input tokens. But in my case,<br>the api cost is dominated by the large system prompt being charged as cached input tokens after every agent turn.
The pricing for an agent loop is understandable when your application code has the agent loop, and is making a new api call after each tool call.<br>But it makes even less sense when you provide a remote MCP server and do not handle the agent loop yourself. OpenAI handles it correctly.<br>A single api call to OpenAI with a remote MCP server will...