Migrating from Claude to DeepSeek without breaking everything
Subscribe
Firetiger runs a fleet of agents that investigate incidents, monitor deployments, and dig through telemetry on behalf of our customers. Every one of those agents is, at its core, a loop around an LLM, which means our single biggest cost of goods sold is inference. The models we choose to power our agents have impacts across our business: on the product experiences we can deliver, how much we need to charge customers, and how we think about handling increased load and scale.<br>Claudes of various flavors and vintages have been our models of choice for the past year, primarily served through AWS Bedrock. Recently, with adoption (and costs) growing, we more seriously investigated migrating to an open source, lower cost model. Finding a way to improve token economics allows us to build even more product experiences, building on the success of things like Change Monitors and Service Monitors.<br>Switching from Claude to DeepSeek in production, we realized 62% reductions in real dollar spend for the first three agent types migrated, going from roughly $606K/yr on Anthropic to $231K/yr on DeepSeek today.<br>After initial explorations, we put our focus on the DeepSeek v4 model family as our cheap(er) date.<br>Just swap the model s/claude/deepseek/g<br>A naive plan for this change: point our API client at a different endpoint, save a bunch of money, go home.<br>In a simple chat product you might even get away with that plan! Alas. Our agents run long (both in time and scope), multi-step investigations with dozens of tool calls. Small behavioral differences between models compound over time, and would have huge impact on overall agent trajectories and product quality.<br>We formed three hypotheses the experiment needed to prove to successfully migrate:<br>DeepSeek models are capable of our tasks without a heroic amount of effort. The metrics that matter: Task completion accuracy.<br>DeepSeek models will be cheaper than Claude, measured on a cost per task basis.<br>DeepSeek powered agents will accomplish their tasks and behave in ways similar to Claude powered agents. The metrics that matter: Steps to completion, time to completion, and error rate.<br>To start small, we scoped our initial experiment to three representative tasks: user-defined tasks centered on exploring their own product, plan generation for monitoring code changes ("given this PR, come up with a plan to monitor the deployment and make sure it does what it's supposed to and doesn't break anything"), and post-hoc root cause analysis.<br>Measuring accuracy required good evals. Ours come in two flavors. Static evals run locally against stored data and mostly target specific capabilities where we've seen models struggle, while living eval datasets update every week: interesting agent sessions get detected, analyzed, and promoted into the dataset as older cases expire.<br>With questions and metrics in place, we flipped our model string and API endpoint for a subset of workloads.<br>Step one: measuring and closing the quality gap<br>We started by swapping models and runing task completion evals. DeepSeek 4.0 Pro scored 65% without reasoning and 80% with it, against our baseline Sonnet 4.6's 94%. These were respectable for a drop-in replacement, but nowhere near shippable. Given the early data, we decided using reasoning with DeepSeek was non-negotiable for success.<br>From here, we ran a self-improvement loop that modified the prompt and tool descriptions. Understanding each turn and identified issue told us where DeepSeek struggles vs Claude. DeepSeek is clearly a capable model, but a few patterns showed up.<br>First: when creating a plan to monitor PR changes going to production, it had trouble finding the secondary and tertiary effects the code would have. Call it "inferring non-local dependencies from a local artifact". When our agents look at a diff, they see:<br>What the code does.<br>What the code calls.<br>What calls the code (with grep).<br>What they don't see, but need to reason about:<br>Who reads the data this code produces.<br>Who depends on the timing this code controls.<br>What invariants other systems assume about this code's behavior.<br>Claude tended to ideate about that second list unprompted, while DeepSeek needed to be told to think carefully about second order effects. Here's the actual change the loop made to the planning prompt:<br>@@ "Understand the changes" section<br>Then:<br>- Read affected files to understand context<br>- Trace code paths to identify what services or components are affected<br>- Pay attention to the PR title, description, and user's request<br>for specific concerns<br>+ Before researching telemetry, complete this two-step trace:<br>+ 1. List what this change produces or alters.<br>+ 2. For each, name what depends on it that does NOT appear in the diff.<br>+ Those off-page dependents are the plan. If a candidate check<br>+ only references things the diff touches, you are monitoring<br>+ the producer, not the change.
Second pattern: DeepSeek would...