We had to build new evals for Fable

We had to build new evals for Fable | Hex Skip to main content

Blog We had to build new evals for Fable Claude Fable 5 is the first model since Opus 4.5 to meaningfully improve at analytical reasoning Izzy Miller Data June 9, 2026

Share:twitterlinkedin

Get started for free

Today Anthropic is releasing Claude Fable 5, the first publicly available Mythos-class model. It’s the first model in a long time that we’ve felt is a step change on the kind of tasks we care about— difficult data analysis in complex, realistic (read: broken & messy) warehouse environments. We’ll be rolling out Fable in Hex this week (although it requires a pre-step for admins to enable – see more below). The first model in a long time that feels like a step change Fable performs very well on our standard set of analytical evals. Claude Fable 5 (high reasoning) scores 93% on Analytical Hard, 93.5% on Semantically Modeled, and 65% on Semantically Unmodeled, which is approaching the realistic single-turn ceiling for that set.

Interestingly, on some of our non-frontier tasks (more on what this means later), we note a performance decrease on Max effort. On these shorter horizon tasks, max effort seems to occasionally lead to overthinking behavior that causes the model to overly second-guess itself and ultimately perform worse in a small number of cases. This is not present in the more challenging semantically unmodeled set and seems to be an artifact of using max effort on easier but unverifiable tasks. Compared to Opus 4.7, Fable 5 is a significant improvement— even with no changes yet to the prompting or harness. Previous model bumps like Opus 4.6, 4.7, and 4.8 have contributed single-digit (and sometimes negative or within noise tolerance) improvements to these sets. What these evals actually measure: Semantically modeled: Questions that are able to be answered using a clean semantic model— requires avoiding many pitfalls and quirks of the data and determining the best definition of vague requirements. Semantically unmodeled: Questions that are not able to be answered using just the semantic model. An agent must do analysis using raw tables in a complex and intentionally confusing environment, synthesizing a lot of disparate context to do things correctly. Analytical hard: Evaluates an agent’s ability to answer questions correctly even when retrieving all relevant context does not resolve certain complications. Agents must make correct assumptions and actually discover things about the dataset in order to perform well. Why does Fable perform so well on these evals? From what we’ve seen, there’s three main contributors: It’s just better at the intuitive little stuff that makes all the difference in analytics. It is a “better analyst” , with all the je ne sais quoi that comes with that— it knows when to double-check without being overly paranoid, has a good nose for which way to slice and dice a problem, and is a much better analytical communicator. It is much better at leveraging what we think of as the “golden workflow”, where an analysis begins in the semantic layer, and if it needs to deviate out into raw data or downstream transformation, the final results are carefully framed and compared to the original semantically modeled data *.* This is how everyone should work, but earlier models often fail here, forgetting to cross-check a final number derived from SQL queries back to relevant semantic ground truth. 3. It’s much better than other models at understanding and defining the assumptions it’s making as it works , and often offers alternatives or further depth to users. Here’s some examples of what that looks like in practice: Example 1: Minimum total MRR On this eval, Opus discovers an interesting quirk of the data and presents it as the primary answer, despite it being obviously (to a human’s eyes) the caveat/footnote that should be attached to the correct answer— SMB. Fable correctly presents the primary finding up front and clearly, and adds some elegant “notes on scope” in which it explains how it defined its terms, points out that consumer data quirk, and notes + proactively presents an alternative definition. Opus wasn’t flat out wrong, and it would even be tempting to mark it as passing— until you see it side-by-side with Fable’s work and realize what the more optimal analytical behavior here is. This pattern plays out reliably across all our evals. Example 2: Median Refund Request by channel This next eval cannot be answered purely using our semantic layer, though there are helpful partial results available there. Here, Opus returns raw data without realizing there’s an obvious (to my eyes!) cents-for-dollars bug affecting this raw table. Instead of understanding the issue correctly or cross-checking to related semantic models, it assumes that these must be “partial/line-item refund requests” and presents the misleading data as-is. Fable is able to start in the semantic layer, move out to raw SQL for transformations, and...

We had to build new evals for Fable

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs