ARC Prize (@arcprize): "Anthropic Opus 4.8 is new SOTA on ARC-AGI-3
Score: 1.5%, ~$10K
ARC-AGI-3 analysis notes:<br>* Opus 4.8 read the environment an abstraction *above* Opus 4.7, as objects & systems, not pictures<br>* Opus 4.8 succeeded on early levels, but still committed to a wrong sub-goal" | XCancel
ARC Prize
@arcprize
2h
Anthropic Opus 4.8 is new SOTA on ARC-AGI-3
Score: 1.5%, ~$10K
ARC-AGI-3 analysis notes:<br>* Opus 4.8 read the environment an abstraction *above* Opus 4.7, as objects & systems, not pictures<br>* Opus 4.8 succeeded on early levels, but still committed to a wrong sub-goal
Jun 1, 2026 · 6:15 PM UTC
29
45
565
32,827
ARC Prize
@arcprize
2h
Analysis Note #1: Opus 4.8 discovered game mechanics more quickly than Opus 4.7
On ar25, Opus 4.8 derived the Level #1 reflection rule by frame 5 ("Blue moved LEFT 3, Orange moved RIGHT 3 ... mirror reflections about col 31")
It then proceeded to clear Level 1 in 24 actions
Opus 4.7 took 136 actions of probing to brute-force the same level and never verbalized the rule.
On dc22, Opus 4.8 hit "BREAKTHROUGH — the maze connects via toggles!" at frame 30 and cleared Level 1-3
Opus 4.7 never identified the agent across 295 actions and 17 RESETs
arcprize.org/replay/22a25f67…
Gif: Opus 4.8 playing ar25
39
3,242
ARC Prize
@arcprize
2h
Analysis Note #2: Opus 4.8 won more early levels, but misattributed it win over Opus 4.7
Opus 4.8 introduces a failure mode Opus 4.7 wasn't reaching. It succeeded on early levels, then committed to a wrong sub-goal on the next
On dc22, Opus 4.8 cleared Level 1-3, then burned ~490 actions on Level 4 cycling through five mutually-contradictory mechanic theories and runs of identical repeated clicks
Opus 4.7 never got far enough to display this shape
Gif: Opus 4.8 playing dc22
13
2,565
ARC Prize
@arcprize
2h
Opus 4.8's most performant Public Demo game
lp85
arcprize.org/replay/57ee8a2d…
10
1,585
ARC Prize
@arcprize
2h
Opus 4.8 wasn’t able to make any progress on tr87 despite this being one of the games most models make progress on
arcprize.org/replay/c3b77031…
11
1,383
ARC Prize
@arcprize
2h
Opus 4.8 ARC-AGI-3 Public Demo Scorecard<br>Score on Public Demo: 4.9%
Note: Public Demo was intentionally designed to be easier than ARC-AGI-3 Semi-Private which explains the score delta (4.9% vs 1.4%)
arcprize.org/scorecards/mode…
12
1,286
ARC Prize
@arcprize
2h
Opus 4.8 ARC-AGI-2 Scores
- Opus 4.8 Low: 62.22% ($1.68/task)<br>- Opus 4.8 Medium: 71.67% ($2.39/task)<br>- Opus 4.8 High: 72.08% ($2.74/task)<br>- Opus 4.8 Max: N/A*
* Opus 4.8 Max was unable to complete ARC-AGI-2 Semi-Private set due to api timeout errors. This score has been excluded
16
1,297
ARC Prize
@arcprize
2h
Opus 4.8 ARC-AGI-1 Scores
- Opus 4.8 Low: 88.0% ($0.67/task)<br>- Opus 4.8 Medium: 91.5% ($0.91/task)<br>- Opus 4.8 High: 92.0% ($1.04/task)<br>- Opus 4.8 Max: 92.5% ($2.33/task)
13
1,176
ARC Prize
@arcprize
2h
Testing notes:
* ARC-AGI-3 Prompt Update - We observed that in our original system prompt, GPT and Gemini, unlike other models, would not "think out loud" in their reply. This caused them to *only* return an action in their response (ex: "ACTION1"). This capped the signal we were able to extract from the model. We updated the system prompt used for ARC-AGI-3 to *explicitly* say context will be carried forward instead of the original *implicit* nudge
See the exact change on the commit below
This will be the system prompt going forward. The previous 6 models already tested will not be re-tested due to costs (estimated at $40K in api costs)
github.com/arcprize/arc-agi-…
* We also noticed that Opus 4.8 used over $10K (our testing limit) in api costs for Semi Private testing (55 games). Going forward we will be implementing a hard max per game for costs
Update default system prompt (#21) · arcprize/arc-agi-3-benchmarking@0138a6a
Reframe the prompt to explicitly invite carry-forward context alongside the action, and clarify that the final action mentioned will be executed.
github.com
15
1,741
ARC Prize
@arcprize
2h
- Leaderboard: arcprize.org/leaderboard<br>- Reproduce the results: github.com/arcprize/arc-agi-… , github.com/arcprize/arc-agi-…<br>- Testing policy: arcprize.org/policy<br>- ARC Prize Foundation is hiring: arcprize.org/jobs
ARC Prize - Leaderboard
The ARC-AGI Leaderboard.
arcprize.org
10
1,554
Tre Smith
@TRESMITH322
18m
Replying to @arcprize
what's margin of error/ standard deviation? the numbers are so low can you even say if one is ahead of the others?
until someone gets up to 10% pretty much this benchmark is 0%
64
Daniel Abyan
@abyandaniel
40m
Replying to @arcprize
It's been less than two months since the ARK-3 was created. The acceleration speed is insane. I think you should start working on ARK-4 now.
159
Cam Will@_camronwilliams
1h
Replying to @arcprize
What must be true in order to reach the desired end state?
Find out on the next episode of dragonball z.
P.S
Dynamic...