LLM Chess – Leaderboard

LLM Chess Leaderboard

Simulating chess games between a Random Player and an LLM. Evaluating Chat Models' (1) chess proficiency and (2) instruction following abilities

/\ \ /\ \ /'\_/`\ /\ _``. /\ \ \ \ \ \ \ \ /\ \ \ \ \/\_\\ \ \___ __ ____ ____ \ \ \ __\ \ \ __\ \ \__\ \ \ \ \/_/_\ \ _ `\ /'__`\ /',__\ /',__\ \ \ \L\ \\ \ \L\ \\ \ \_/\ \ \ \ \L\ \\ \ \ \ \/\ __//\__, `\/\__, `\ \ \____/ \ \____/ \ \_\\ \_\ \ \____/ \ \_\ \_\ \____\/\____/\/\____/ \/___/ \/___/ \/_/ \/_/ \/___/ \/_/\/_/\/____/\/___/ \/___/

/\ \ /\ \ /\ "-./ \ \ \ \____ \ \ \____ \ \ \-./\ \ \ \_____\ \ \_____\ \ \_\ \ \_\ \/_____/ \/_____/ \/_/ \/_/ /\ ___\ /\ \_\ \ /\ ___\ /\ ___\ /\ ___\ \ \ \____ \ \ __ \ \ \ __\ \ \___ \ \ \___ \ \ \_____\ \ \_\ \_\ \ \_____\ \/\_____\ \/\_____\ \/_____/ \/_/\/_/ \/_____/ \/_____/ \/_____/

Random Player (White)

♜ ♞ ♝ ♛ ♚ ♝ ♞ ♜ ♟ ♟ ♟ ♟ ♟ ♟ ♟ ♟ ♙ ♙ ♙ ♙ ♙ ♙ ♙ ♙ ♖ ♘ ♗ ♕ ♔ ♗ ♘ ♖

GAME OVER

- Outcome: Draw

- Max moves reached: 200

- Material White: 16

- Material Black: 18

GPT-4o Mini (Black)

Can Large Language Models play chess? Let's find out ツ

This leaderboard evaluates chess skill and instruction following in an agentic setting: LLMs engage in multi-turn dialogs where they are presented with a choice of actions (e.g., "get board" or "make move") when playing against an opponent (Random Player or Chess Engine).

In 2024, we began with a chaos monkey baseline — a Random Player that chooses legal moves at random. At the time, most models could barely compete and lost either due to an inability to follow game instructions (i.e., hallucinating illegal moves or taking incorrect actions) or by dragging the game to the 200-move limit because they couldn't win.

In 2025, more capable reasoning models nailed both instruction following and chess skill. We've added the Komodo Dragon Chess Engine as a more capable opponent, which is also Elo-rated on chess.com. This allowed us to anchor the results to a real-world rating scale and compute an Elo rating for each model.

Select columns (max 7)

METRICS:

- Player: Model name (playing as Black). Models that also played vs Dragon are marked with an asterisk in superscript (e.g., 3*).

- Elo: Estimated Elo anchored by Dragon skill levels and calibrated Random. We solve a 1D MLE over aggregated blocks (opponent Elo, wins, draws, losses) and report ±95% CI. When both Random and Dragon data exist, they are combined. Empty Elo appears for extreme 100% win/loss or no anchored games.

- Game Duration: Share of maximum game length completed (0-100%); measures instruction-following stability across many moves. 100% means no games were interrupted due to model haluscinating moves or actions. 50% means that on average the model boroke the game loop mid-game (making an average 100 moves out of max 200 allowed)

- Tokens: Completion tokens per move; verbosity/efficiency signal.

- Cost/Elo (main): Estimated cost per 1000 Elo points (Cost/Game divided by Elo, then scaled by 1000). Lower is more cost-efficient.

- Cost/Game (extended): Estimated cost per game based on token usage and model pricing.

ARRANGEMENT & SOURCES:

- Primary sorting: Elo (DESC), then Game Duration (DESC), Tokens (ASC).

- Data sources mix Random-vs-LLM and Dragon-vs-LLM games. Dragon levels map to Elo and provide the anchor; Random is first calibrated vs Dragon and then used as an opponent for many models.

- Elo ratings are not comparable across player pools, i.e. you can not compare chess.com Elo to FIDE Elo

- Chess.com references used for context (as of Sep 2025): Rapid Leaderboard (Elo pool), Magnus Carlsen stats, and Elo explanation & player classes.

MATRIX VISUALIZATION:

This plots shows LLM chess players based on two key metrics:

- X-Axis: Game Duration (0-100%) - Shows how well models maintain correct communication protocols throughout the game. Higher values indicate better instruction following ability.

- Y-Axis: Win Rate (0-100%) - The metric is less strict than Win/Loss (Non-Interrupted) used in the leaderboard as it ignores technichal losses due to poor instruction following. Higher values indicate better chess strategy and decision making.

INTERPRETATION:

- Top-Right: Models with both excellent chess skill and instruction following.

- Top-Left: Models with good chess skill but struggle to maintain communication protocol.

- Bottom-Right: Models that follow instructions well but make poor chess moves.

- Bottom-Left: Models that struggle with both chess strategy and following instructions.

Libraries and Dependencies Used:

- chess: A Python library for handling chess game rules and basic operations, including board representation, legal move evaluation, and game state evaluation. This is not a chess engine running the actual calculation of the best move.

- AG2 (aka Autogen) is used as a backbone for LLM communication. It also implements the interaction between a Chess Board and custom agents like GameAgent, RandomPlayerAgent, AutoReplyAgent, and others...

LLM Chess – Leaderboard

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

German ruling declares Google liable for false answers in AI Overviews