Testing Claude Sonnet 5's agentic claims

Claude Sonnet 5: Testing Anthropic's "Most Agentic" Claim

On this page What Does Highly Agentic Even Mean?Testing the Agentic Capability with a Messy TaskWhat We Found: Only Sonnet 5 Caught the TrapWhere Sonnet 4.6 Actually FailedConclusionTry It YourselfRelated

Blog

Claude Sonnet 5: Testing Anthropic's "Most Agentic" Claim

Reynaldi Chernando

July 3, 2026

On this pageWhat Does Highly Agentic Even Mean?Testing the Agentic Capability with a Messy TaskWhat We Found: Only Sonnet 5 Caught the TrapWhere Sonnet 4.6 Actually FailedConclusionTry It YourselfRelated

We recently added Claude Sonnet 5 to Puter.js. Anthropic's pitch for the model is Opus 4.8-level performance at a lower price.

On pricing, people quickly found out that in some cases it's more expensive than Opus. Artificial Analysis's benchmark shows Sonnet 5's cost per task coming in higher than Opus 4.8:

"Claude Sonnet 5 costs more per task than Opus 4.8 before accounting for promotional pricing" (Artificial Analysis).

While most of the attention is on the price, we tested the other thing Anthropic claims. Sonnet 5 is supposed to be highly agentic, the "most agentic Sonnet model yet" per the announcement, able to finish complex tasks where other models stop short and check its output without being explicitly asked.

What Does Highly Agentic Even Mean?

"Agentic" itself is marketing language, but one aspect of it is the model's ability to follow through a plan or multi-step action without stopping midway.

The other aspect is autonomy, where the model makes reasonable decisions on its own without you having to step in and clarify every 30 seconds.

For coding, the two together mean you can give a simple prompt and the model will go off and fix things on its own.

As one testimonial in Anthropic's announcement describes, an engineer asked Sonnet 5 to investigate a bug. Without being asked, the model wrote a test reproducing the bug, implemented the fix, then removed the fix to confirm the bug came back, all in a single pass.

Testing the Agentic Capability with a Messy Task

We gave both Sonnet 5 and its predecessor, Sonnet 4.6, this prompt in a fresh chat. It's deliberately short and underspecified, the way prompts usually get typed, so the model has to make decisions on its own:

Give everyone with 5+ years at the company a 7% raise, everyone else 4%, and draft a quick note to finance. Sam, $85,000, 03-12-2018 Riley, $72,000, 25-04-2019 Jordan, $94,000, 06-07-2021 Casey, $105,000, 11-08-2016 Morgan, $77,000, 09-30-2022

We deliberately set up a trap in the data:

The hire dates secretly mix two formats. Riley's date (25-04-2019) can only be DD-MM-YYYY, since there is no month 25. Morgan's date (09-30-2022) can only be MM-DD-YYYY, since there is no month 30. The sheet is internally inconsistent.

One employee's date (06-07-2021) is ambiguous in a way that actually matters. We ran the test on July 3, 2026. If the date is June 7, Jordan is just past the 5-year mark and gets 7%. If it's July 6, they miss it by three days and get 4%.

Any model at this level can do the math. The test is whether the model notices, unprompted, that a decision is hiding in the data.

What We Found: Only Sonnet 5 Caught the Trap

On multi-step follow-through, both models pass. Sonnet 5 and Sonnet 4.6 completed the full task (calculating tenure and raises, as well as writing the finance note) without stalling or asking permission. Completing the steps is pretty standard at the frontier now.

But Sonnet 4.6 fell into the trap without ever noticing it. It parsed one date as DD-MM and another as MM-DD in the same table and never noticed the contradiction. For the ambiguous employee, it silently picked one interpretation and stated the result as fact:

| Jordan | Jul 6, 2021 | 4.9 yrs | 4% | $94,000 | $97,760 |

It even flagged that employee, but for the wrong reason:

One note: Jordan's start date (Jul 6, 2021) lands at just under 5 years as of today, so they fell into the 4% bucket — worth double-checking if you want to count that as a borderline case.

It sensed risk at the 5-year cutoff, but never saw the actual problem, which is that the date itself is ambiguous.

Sonnet 4.6, full output (verbatim)

Here are the updated salaries, calculated as of today (July 3, 2026):

| Name | Start Date | Years | Raise | Old Salary | New Salary | |------|-----------|-------|-------|-----------|-----------| | Sam | Mar 12, 2018 | 8.3 yrs | 7% | $85,000 | $90,950 | | Riley | Apr 25, 2019 | 7.2 yrs | 7% | $72,000 | $77,040 | | Jordan | Jul 6, 2021 | 4.9 yrs | 4% | $94,000 | $97,760 | | Casey | Aug 11, 2016 | 9.9 yrs | 7% | $105,000 | $112,350 | | Morgan | Sep 30, 2022 | 3.8 yrs | 4% | $77,000 | $80,080 |

**5+ years (7%):** Sam, Riley, Casey — Jordan just misses the cutoff at ~4.9 years.

Here's a draft note to finance:One note: Jordan's start date (Jul 6, 2021) lands at just under 5 years as of today, so they fell into the 4% bucket — worth double-checking if you want to count that...

Testing Claude Sonnet 5's agentic claims

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

ZCode – Harness for GLM-5.2

Apertus – Open Foundation Model for Sovereign AI