Anthropic Opus 4.8 is new SOTA on ARC-AGI-3, Score: 1.5%, –$10K

ARC Prize (@arcprize): "Anthropic Opus 4.8 is new SOTA on ARC-AGI-3

Score: 1.5%, ~$10K

ARC-AGI-3 analysis notes: * Opus 4.8 read the environment an abstraction *above* Opus 4.7, as objects & systems, not pictures * Opus 4.8 succeeded on early levels, but still committed to a wrong sub-goal" | XCancel

ARC Prize

@arcprize

Anthropic Opus 4.8 is new SOTA on ARC-AGI-3

Score: 1.5%, ~$10K

Jun 1, 2026 · 6:15 PM UTC

565

32,827

ARC Prize

@arcprize

Analysis Note #1: Opus 4.8 discovered game mechanics more quickly than Opus 4.7

On ar25, Opus 4.8 derived the Level #1 reflection rule by frame 5 ("Blue moved LEFT 3, Orange moved RIGHT 3 ... mirror reflections about col 31")

It then proceeded to clear Level 1 in 24 actions

Opus 4.7 took 136 actions of probing to brute-force the same level and never verbalized the rule.

On dc22, Opus 4.8 hit "BREAKTHROUGH — the maze connects via toggles!" at frame 30 and cleared Level 1-3

Opus 4.7 never identified the agent across 295 actions and 17 RESETs

arcprize.org/replay/22a25f67…

Gif: Opus 4.8 playing ar25

3,242

ARC Prize

@arcprize

Analysis Note #2: Opus 4.8 won more early levels, but misattributed it win over Opus 4.7

Opus 4.8 introduces a failure mode Opus 4.7 wasn't reaching. It succeeded on early levels, then committed to a wrong sub-goal on the next

On dc22, Opus 4.8 cleared Level 1-3, then burned ~490 actions on Level 4 cycling through five mutually-contradictory mechanic theories and runs of identical repeated clicks

Opus 4.7 never got far enough to display this shape

Gif: Opus 4.8 playing dc22

2,565

ARC Prize

@arcprize

Opus 4.8's most performant Public Demo game

lp85

arcprize.org/replay/57ee8a2d…

1,585

ARC Prize

@arcprize

Opus 4.8 wasn’t able to make any progress on tr87 despite this being one of the games most models make progress on

arcprize.org/replay/c3b77031…

1,383

ARC Prize

@arcprize

Opus 4.8 ARC-AGI-3 Public Demo Scorecard Score on Public Demo: 4.9%

Note: Public Demo was intentionally designed to be easier than ARC-AGI-3 Semi-Private which explains the score delta (4.9% vs 1.4%)

arcprize.org/scorecards/mode…

1,286

ARC Prize

@arcprize

Opus 4.8 ARC-AGI-2 Scores

- Opus 4.8 Low: 62.22% ($1.68/task) - Opus 4.8 Medium: 71.67% ($2.39/task) - Opus 4.8 High: 72.08% ($2.74/task) - Opus 4.8 Max: N/A*

* Opus 4.8 Max was unable to complete ARC-AGI-2 Semi-Private set due to api timeout errors. This score has been excluded

1,297

ARC Prize

@arcprize

Opus 4.8 ARC-AGI-1 Scores

- Opus 4.8 Low: 88.0% ($0.67/task) - Opus 4.8 Medium: 91.5% ($0.91/task) - Opus 4.8 High: 92.0% ($1.04/task) - Opus 4.8 Max: 92.5% ($2.33/task)

1,176

ARC Prize

@arcprize

Testing notes:

* ARC-AGI-3 Prompt Update - We observed that in our original system prompt, GPT and Gemini, unlike other models, would not "think out loud" in their reply. This caused them to *only* return an action in their response (ex: "ACTION1"). This capped the signal we were able to extract from the model. We updated the system prompt used for ARC-AGI-3 to *explicitly* say context will be carried forward instead of the original *implicit* nudge

See the exact change on the commit below

This will be the system prompt going forward. The previous 6 models already tested will not be re-tested due to costs (estimated at $40K in api costs)

github.com/arcprize/arc-agi-…

* We also noticed that Opus 4.8 used over $10K (our testing limit) in api costs for Semi Private testing (55 games). Going forward we will be implementing a hard max per game for costs

Update default system prompt (#21) · arcprize/arc-agi-3-benchmarking@0138a6a

Reframe the prompt to explicitly invite carry-forward context alongside the action, and clarify that the final action mentioned will be executed.

github.com

1,741

ARC Prize

@arcprize

- Leaderboard: arcprize.org/leaderboard - Reproduce the results: github.com/arcprize/arc-agi-… , github.com/arcprize/arc-agi-… - Testing policy: arcprize.org/policy - ARC Prize Foundation is hiring: arcprize.org/jobs

ARC Prize - Leaderboard

The ARC-AGI Leaderboard.

arcprize.org

1,554

Tre Smith

@TRESMITH322

18m

Replying to @arcprize

what's margin of error/ standard deviation? the numbers are so low can you even say if one is ahead of the others?

until someone gets up to 10% pretty much this benchmark is 0%

Daniel Abyan

@abyandaniel

40m

Replying to @arcprize

It's been less than two months since the ARK-3 was created. The acceleration speed is insane. I think you should start working on ARK-4 now.

159

Cam Will@_camronwilliams

Replying to @arcprize

What must be true in order to reach the desired end state?

Find out on the next episode of dragonball z.

P.S

Dynamic...

Anthropic Opus 4.8 is new SOTA on ARC-AGI-3, Score: 1.5%, –$10K

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy