LLM Chess Leaderboard
LLM Chess Leaderboard
Simulating chess games between a Random Player and an LLM. Evaluating Chat Models' (1) chess proficiency and (2)<br>instruction following abilities
/\ \ /\ \ /'\_/`\ /\ _``. /\ \<br>\ \ \ \ \ \ /\ \ \ \ \/\_\\ \ \___ __ ____ ____<br>\ \ \ __\ \ \ __\ \ \__\ \ \ \ \/_/_\ \ _ `\ /'__`\ /',__\ /',__\<br>\ \ \L\ \\ \ \L\ \\ \ \_/\ \ \ \ \L\ \\ \ \ \ \/\ __//\__, `\/\__, `\<br>\ \____/ \ \____/ \ \_\\ \_\ \ \____/ \ \_\ \_\ \____\/\____/\/\____/<br>\/___/ \/___/ \/_/ \/_/ \/___/ \/_/\/_/\/____/\/___/ \/___/
/\ \ /\ \ /\ "-./ \<br>\ \ \____ \ \ \____ \ \ \-./\ \<br>\ \_____\ \ \_____\ \ \_\ \ \_\<br>\/_____/ \/_____/ \/_/ \/_/<br>/\ ___\ /\ \_\ \ /\ ___\ /\ ___\ /\ ___\<br>\ \ \____ \ \ __ \ \ \ __\ \ \___ \ \ \___ \<br>\ \_____\ \ \_\ \_\ \ \_____\ \/\_____\ \/\_____\<br>\/_____/ \/_/\/_/ \/_____/ \/_____/ \/_____/
Random Player (White)
♜ ♞ ♝ ♛ ♚ ♝ ♞ ♜<br>♟ ♟ ♟ ♟ ♟ ♟ ♟ ♟<br>♙ ♙ ♙ ♙ ♙ ♙ ♙ ♙<br>♖ ♘ ♗ ♕ ♔ ♗ ♘ ♖
GAME OVER
- Outcome: Draw
- Max moves reached: 200
- Material White: 16
- Material Black: 18
GPT-4o Mini (Black)
Can Large Language Models play chess? Let's find out ツ
This leaderboard evaluates chess skill and instruction following<br>in an agentic setting:<br>LLMs engage in multi-turn dialogs where they are presented with a choice of actions (e.g., "get board" or<br>"make move") when playing against<br>an opponent (Random Player or Chess Engine).
In 2024, we began with a chaos monkey baseline — a Random Player that chooses legal moves at<br>random.<br>At the time, most models could barely compete and lost either due to an inability to follow game<br>instructions<br>(i.e., hallucinating illegal moves or taking incorrect actions) or by dragging the game to the 200-move<br>limit because they<br>couldn't win.
In 2025, more capable reasoning models nailed both instruction following and chess skill.<br>We've added the Komodo Dragon Chess Engine as a more capable opponent, which is also<br>Elo-rated on<br>chess.com. This allowed us to anchor the results to a real-world rating scale and compute an Elo rating for<br>each model.
Select columns (max 7)
METRICS:
- Player: Model name (playing as Black). Models that also played vs Dragon are marked<br>with an asterisk in superscript (e.g., 3*).
- Elo: Estimated Elo anchored by Dragon skill levels and calibrated Random. We solve a<br>1D MLE over aggregated blocks (opponent Elo, wins, draws, losses) and report ±95% CI. When both Random<br>and Dragon data exist, they are combined. Empty Elo appears for extreme 100% win/loss or no anchored<br>games.
- Game Duration: Share of maximum game length completed (0-100%); measures<br>instruction-following stability across many moves. 100% means no games were interrupted due to model<br>haluscinating moves or actions. 50% means that on average the model boroke the game loop mid-game<br>(making an average 100 moves out of max 200 allowed)
- Tokens: Completion tokens per move; verbosity/efficiency signal.
- Cost/Elo (main): Estimated cost per 1000 Elo points (Cost/Game divided by Elo, then scaled by 1000). Lower is more cost-efficient.
- Cost/Game (extended): Estimated cost per game based on token usage and model pricing.
ARRANGEMENT & SOURCES:
- Primary sorting: Elo (DESC), then Game Duration (DESC), Tokens (ASC).
- Data sources mix Random-vs-LLM and Dragon-vs-LLM games. Dragon levels map to Elo and provide the<br>anchor; Random is first calibrated vs Dragon and then used as an opponent for many models.
- Elo ratings are not comparable across player pools, i.e. you can not compare chess.com Elo to FIDE<br>Elo
- Chess.com references used for context (as of Sep 2025): Rapid Leaderboard (Elo<br>pool), Magnus Carlsen<br>stats, and Elo explanation &<br>player classes.
MATRIX VISUALIZATION:
This plots shows LLM chess players based on two key metrics:
- X-Axis: Game Duration (0-100%) - Shows how well models maintain correct communication<br>protocols throughout the game. Higher values indicate better instruction following ability.
- Y-Axis: Win Rate (0-100%) - The metric is less strict than Win/Loss (Non-Interrupted)<br>used in the leaderboard as it ignores technichal losses due to poor instruction following. Higher values<br>indicate better chess strategy and decision making.
INTERPRETATION:
- Top-Right: Models with both excellent chess skill and instruction following.
- Top-Left: Models with good chess skill but struggle to maintain communication<br>protocol.
- Bottom-Right: Models that follow instructions well but make poor chess moves.
- Bottom-Left: Models that struggle with both chess strategy and following<br>instructions.
Libraries and Dependencies Used:
- chess: A Python library for handling chess game rules and basic operations, including<br>board representation, legal move evaluation, and game state evaluation. This is not a chess engine<br>running the actual calculation of the best move.
- AG2 (aka Autogen) is used as a backbone for LLM communication. It also implements the<br>interaction between a Chess Board and custom agents like GameAgent, RandomPlayerAgent, AutoReplyAgent,<br>and others...