LLM Chess – Leaderboard

elwell2 pts0 comments

LLM Chess Leaderboard

LLM Chess Leaderboard

Simulating chess games between a Random Player and an LLM. Evaluating Chat Models' (1) chess proficiency and (2)<br>instruction following abilities

/\ \ /\ \ /'\_/`\ /\ _``. /\ \<br>\ \ \ \ \ \ /\ \ \ \ \/\_\\ \ \___ __ ____ ____<br>\ \ \ __\ \ \ __\ \ \__\ \ \ \ \/_/_\ \ _ `\ /'__`\ /',__\ /',__\<br>\ \ \L\ \\ \ \L\ \\ \ \_/\ \ \ \ \L\ \\ \ \ \ \/\ __//\__, `\/\__, `\<br>\ \____/ \ \____/ \ \_\\ \_\ \ \____/ \ \_\ \_\ \____\/\____/\/\____/<br>\/___/ \/___/ \/_/ \/_/ \/___/ \/_/\/_/\/____/\/___/ \/___/

/\ \ /\ \ /\ "-./ \<br>\ \ \____ \ \ \____ \ \ \-./\ \<br>\ \_____\ \ \_____\ \ \_\ \ \_\<br>\/_____/ \/_____/ \/_/ \/_/<br>/\ ___\ /\ \_\ \ /\ ___\ /\ ___\ /\ ___\<br>\ \ \____ \ \ __ \ \ \ __\ \ \___ \ \ \___ \<br>\ \_____\ \ \_\ \_\ \ \_____\ \/\_____\ \/\_____\<br>\/_____/ \/_/\/_/ \/_____/ \/_____/ \/_____/

Random Player (White)

♜ ♞ ♝ ♛ ♚ ♝ ♞ ♜<br>♟ ♟ ♟ ♟ ♟ ♟ ♟ ♟<br>♙ ♙ ♙ ♙ ♙ ♙ ♙ ♙<br>♖ ♘ ♗ ♕ ♔ ♗ ♘ ♖

GAME OVER

- Outcome: Draw

- Max moves reached: 200

- Material White: 16

- Material Black: 18

GPT-4o Mini (Black)

Can Large Language Models play chess? Let's find out ツ

This leaderboard evaluates chess skill and instruction following<br>in an agentic setting:<br>LLMs engage in multi-turn dialogs where they are presented with a choice of actions (e.g., "get board" or<br>"make move") when playing against<br>an opponent (Random Player or Chess Engine).

In 2024, we began with a chaos monkey baseline — a Random Player that chooses legal moves at<br>random.<br>At the time, most models could barely compete and lost either due to an inability to follow game<br>instructions<br>(i.e., hallucinating illegal moves or taking incorrect actions) or by dragging the game to the 200-move<br>limit because they<br>couldn't win.

In 2025, more capable reasoning models nailed both instruction following and chess skill.<br>We've added the Komodo Dragon Chess Engine as a more capable opponent, which is also<br>Elo-rated on<br>chess.com. This allowed us to anchor the results to a real-world rating scale and compute an Elo rating for<br>each model.

Select columns (max 7)

METRICS:

- Player: Model name (playing as Black). Models that also played vs Dragon are marked<br>with an asterisk in superscript (e.g., 3*).

- Elo: Estimated Elo anchored by Dragon skill levels and calibrated Random. We solve a<br>1D MLE over aggregated blocks (opponent Elo, wins, draws, losses) and report ±95% CI. When both Random<br>and Dragon data exist, they are combined. Empty Elo appears for extreme 100% win/loss or no anchored<br>games.

- Game Duration: Share of maximum game length completed (0-100%); measures<br>instruction-following stability across many moves. 100% means no games were interrupted due to model<br>haluscinating moves or actions. 50% means that on average the model boroke the game loop mid-game<br>(making an average 100 moves out of max 200 allowed)

- Tokens: Completion tokens per move; verbosity/efficiency signal.

- Cost/Elo (main): Estimated cost per 1000 Elo points (Cost/Game divided by Elo, then scaled by 1000). Lower is more cost-efficient.

- Cost/Game (extended): Estimated cost per game based on token usage and model pricing.

ARRANGEMENT & SOURCES:

- Primary sorting: Elo (DESC), then Game Duration (DESC), Tokens (ASC).

- Data sources mix Random-vs-LLM and Dragon-vs-LLM games. Dragon levels map to Elo and provide the<br>anchor; Random is first calibrated vs Dragon and then used as an opponent for many models.

- Elo ratings are not comparable across player pools, i.e. you can not compare chess.com Elo to FIDE<br>Elo

- Chess.com references used for context (as of Sep 2025): Rapid Leaderboard (Elo<br>pool), Magnus Carlsen<br>stats, and Elo explanation &<br>player classes.

MATRIX VISUALIZATION:

This plots shows LLM chess players based on two key metrics:

- X-Axis: Game Duration (0-100%) - Shows how well models maintain correct communication<br>protocols throughout the game. Higher values indicate better instruction following ability.

- Y-Axis: Win Rate (0-100%) - The metric is less strict than Win/Loss (Non-Interrupted)<br>used in the leaderboard as it ignores technichal losses due to poor instruction following. Higher values<br>indicate better chess strategy and decision making.

INTERPRETATION:

- Top-Right: Models with both excellent chess skill and instruction following.

- Top-Left: Models with good chess skill but struggle to maintain communication<br>protocol.

- Bottom-Right: Models that follow instructions well but make poor chess moves.

- Bottom-Left: Models that struggle with both chess strategy and following<br>instructions.

Libraries and Dependencies Used:

- chess: A Python library for handling chess game rules and basic operations, including<br>board representation, legal move evaluation, and game state evaluation. This is not a chess engine<br>running the actual calculation of the best move.

- AG2 (aka Autogen) is used as a backbone for LLM communication. It also implements the<br>interaction between a Chess Board and custom agents like GameAgent, RandomPlayerAgent, AutoReplyAgent,<br>and others...

chess game ____ _____ models random

Related Articles