Claude Fable 5 vs. GPT-5.5: Better Planning, Similar Execution

justiceforsaas3 pts0 comments

Claude Fable 5 vs GPT-5.5: better planning, similar execution

Kilo Blog

SubscribeSign in

Claude Fable 5 vs GPT-5.5: better planning, similar execution

Darko and Job Rietbergen<br>Jun 13, 2026

Share

Update: We wrote this post on June 11 and published it on June 13. Anthropic has since disabled access to Claude Fable 5 after a US government directive, which makes some of these results even more relevant. Fable 5 was a strong model, especially at planning, but our testing did not show a massive jump on coding ability that many people were pitching on social media. Once we had a detailed plan, GPT-5.5 performed similarly on execution.<br>The post:<br>Anthropic released Claude Fable 5, a Mythos-class model positioned for long-running agentic work and ambitious coding.<br>Instead of doing yet another end-to-end coding comparison against GPT-5.5, we split the work into two rounds. Both models planned the same service, we scored the plans against a rubric, and then both models implemented the winning plan from identical starting points in Kilo Code CLI.

TL;DR: Claude Fable 5 wrote the better plan (9.1 vs 8.3 on our rubric), but when both models implemented that same plan, both passed all 15 of our acceptance checks and produced identical rollout behavior, with GPT-5.5 spending $6.30 to Claude Fable 5’s $16.66. Planning with Claude Fable 5 and implementing with GPT-5.5 produced the same service for 59% less than using Claude Fable 5 for both phases.<br>Why Split Planning From Implementation

Most model comparisons run end-to-end, which makes it hard to tell whether a bad result came from a bad plan or bad execution. Separating the phases lets us measure three things with the same inputs. How do the models compare at planning? How do they compare when implementing the exact same plan? And does mixing them (one model plans, the other implements) actually work?<br>That last question matters for cost. The two models sit at meaningfully different price points:

Both of these are frontier models. GPT-5.5 is OpenAI’s newest flagship and a strong coding model in its own right, at a lower per-token price. The question is whether the most expensive model on the market needs to be in both phases of the workflow.<br>Our Test Setup

We asked both models to plan a feature flag service, an of internal tool where you turn features on for a percentage of your users and ramp that percentage up over time.<br>We picked this task because it hides a real correctness trap. Percentage rollouts must be sticky (the same user always gets the same answer) and growing a rollout from 20% to 40% must keep the original 20% of users enabled, all without storing any per-user state. A plan that hand-waves this with “use a hash” leaves the hard decision to the implementer. A plan that specifies the exact bucketing math removes it.<br>Each model got the same prompt in a fresh Kilo Code CLI session, both at High reasoning:<br>I’m building a feature flag service using Bun, Hono, TypeScript, and better-sqlite3. It needs to support boolean flags and percentage-based rollouts, scoped per environment (dev, staging, production). Requirements:<br>CRUD endpoints for managing flags and their per-environment configurations<br>An evaluation endpoint that takes a flag key, environment, and user ID, and returns whether the flag is on for that user. Percentage rollouts must be sticky, meaning the same user ID always gets the same result for the same flag at the same rollout percentage, with no per-user state stored in the database<br>Increasing a rollout from 20% to 40% must keep the original 20% of users enabled<br>An in-memory cache for flag configs on the evaluation path, with invalidation when a flag changes<br>An audit log recording every flag change (who, what, when, before/after values)<br>API key authentication for the management endpoints, with keys stored hashed<br>Please write me a very detailed plan in plan.md that I can hand to a developer to build from.

Kilo Code CLI session running the planning prompt with Claude Fable 5.<br>Let’s see the results.<br>Round 1: Planning

Both planning runs finished in about two and a half minutes.

Both Fable 5 and GPT-5.5 got the hard requirement right, and they converged on the same core algorithm: Hash the flag key and user ID into one of 10,000 buckets, then enable the user if their bucket falls below the rollout percentage. Raising the percentage only adds buckets, so the original users stay enabled. Both plans explained the math and specified tests to prove it.<br>The gap came from everything around the algorithm. We scored both plans against a weighted rubric covering rollout correctness, reliability design, security, decomposition, implementability, operational clarity, and communication. We defined the criteria when we designed the prompt, before either plan existed, since each requirement in the prompt maps to one of them.

Two criteria drove the result.<br>Reliability design. Claude Fable 5’s plan caught failure modes that GPT-5.5’s never mentioned.

The...

fable plan claude planning flag percentage

Related Articles