Anthropic Opus 4.8 is new SOTA on ARC-AGI-3, Score: 1.5%, –$10K

szatkus1 pts0 comments

ARC Prize (@arcprize): "Anthropic Opus 4.8 is new SOTA on ARC-AGI-3

Score: 1.5%, ~$10K

ARC-AGI-3 analysis notes:<br>* Opus 4.8 read the environment an abstraction *above* Opus 4.7, as objects & systems, not pictures<br>* Opus 4.8 succeeded on early levels, but still committed to a wrong sub-goal" | XCancel

ARC Prize

@arcprize

2h

Anthropic Opus 4.8 is new SOTA on ARC-AGI-3

Score: 1.5%, ~$10K

ARC-AGI-3 analysis notes:<br>* Opus 4.8 read the environment an abstraction *above* Opus 4.7, as objects & systems, not pictures<br>* Opus 4.8 succeeded on early levels, but still committed to a wrong sub-goal

Jun 1, 2026 · 6:15 PM UTC

29

45

565

32,827

ARC Prize

@arcprize

2h

Analysis Note #1: Opus 4.8 discovered game mechanics more quickly than Opus 4.7

On ar25, Opus 4.8 derived the Level #1 reflection rule by frame 5 ("Blue moved LEFT 3, Orange moved RIGHT 3 ... mirror reflections about col 31")

It then proceeded to clear Level 1 in 24 actions

Opus 4.7 took 136 actions of probing to brute-force the same level and never verbalized the rule.

On dc22, Opus 4.8 hit "BREAKTHROUGH — the maze connects via toggles!" at frame 30 and cleared Level 1-3

Opus 4.7 never identified the agent across 295 actions and 17 RESETs

arcprize.org/replay/22a25f67…

Gif: Opus 4.8 playing ar25

39

3,242

ARC Prize

@arcprize

2h

Analysis Note #2: Opus 4.8 won more early levels, but misattributed it win over Opus 4.7

Opus 4.8 introduces a failure mode Opus 4.7 wasn't reaching. It succeeded on early levels, then committed to a wrong sub-goal on the next

On dc22, Opus 4.8 cleared Level 1-3, then burned ~490 actions on Level 4 cycling through five mutually-contradictory mechanic theories and runs of identical repeated clicks

Opus 4.7 never got far enough to display this shape

Gif: Opus 4.8 playing dc22

13

2,565

ARC Prize

@arcprize

2h

Opus 4.8's most performant Public Demo game

lp85

arcprize.org/replay/57ee8a2d…

10

1,585

ARC Prize

@arcprize

2h

Opus 4.8 wasn’t able to make any progress on tr87 despite this being one of the games most models make progress on

arcprize.org/replay/c3b77031…

11

1,383

ARC Prize

@arcprize

2h

Opus 4.8 ARC-AGI-3 Public Demo Scorecard<br>Score on Public Demo: 4.9%

Note: Public Demo was intentionally designed to be easier than ARC-AGI-3 Semi-Private which explains the score delta (4.9% vs 1.4%)

arcprize.org/scorecards/mode…

12

1,286

ARC Prize

@arcprize

2h

Opus 4.8 ARC-AGI-2 Scores

- Opus 4.8 Low: 62.22% ($1.68/task)<br>- Opus 4.8 Medium: 71.67% ($2.39/task)<br>- Opus 4.8 High: 72.08% ($2.74/task)<br>- Opus 4.8 Max: N/A*

* Opus 4.8 Max was unable to complete ARC-AGI-2 Semi-Private set due to api timeout errors. This score has been excluded

16

1,297

ARC Prize

@arcprize

2h

Opus 4.8 ARC-AGI-1 Scores

- Opus 4.8 Low: 88.0% ($0.67/task)<br>- Opus 4.8 Medium: 91.5% ($0.91/task)<br>- Opus 4.8 High: 92.0% ($1.04/task)<br>- Opus 4.8 Max: 92.5% ($2.33/task)

13

1,176

ARC Prize

@arcprize

2h

Testing notes:

* ARC-AGI-3 Prompt Update - We observed that in our original system prompt, GPT and Gemini, unlike other models, would not "think out loud" in their reply. This caused them to *only* return an action in their response (ex: "ACTION1"). This capped the signal we were able to extract from the model. We updated the system prompt used for ARC-AGI-3 to *explicitly* say context will be carried forward instead of the original *implicit* nudge

See the exact change on the commit below

This will be the system prompt going forward. The previous 6 models already tested will not be re-tested due to costs (estimated at $40K in api costs)

github.com/arcprize/arc-agi-…

* We also noticed that Opus 4.8 used over $10K (our testing limit) in api costs for Semi Private testing (55 games). Going forward we will be implementing a hard max per game for costs

Update default system prompt (#21) · arcprize/arc-agi-3-benchmarking@0138a6a

Reframe the prompt to explicitly invite carry-forward context alongside the action, and clarify that the final action mentioned will be executed.

github.com

15

1,741

ARC Prize

@arcprize

2h

- Leaderboard: arcprize.org/leaderboard<br>- Reproduce the results: github.com/arcprize/arc-agi-… , github.com/arcprize/arc-agi-…<br>- Testing policy: arcprize.org/policy<br>- ARC Prize Foundation is hiring: arcprize.org/jobs

ARC Prize - Leaderboard

The ARC-AGI Leaderboard.

arcprize.org

10

1,554

Tre Smith

@TRESMITH322

18m

Replying to @arcprize

what's margin of error/ standard deviation? the numbers are so low can you even say if one is ahead of the others?

until someone gets up to 10% pretty much this benchmark is 0%

64

Daniel Abyan

@abyandaniel

40m

Replying to @arcprize

It's been less than two months since the ARK-3 was created. The acceleration speed is insane. I think you should start working on ARK-4 now.

159

Cam Will@_camronwilliams

1h

Replying to @arcprize

What must be true in order to reach the desired end state?

Find out on the next episode of dragonball z.

P.S

Dynamic...

opus arcprize prize task score level

Related Articles