Agent checkpointing is far from production-grade resiliency | Restate
4.0k
Cloud LoginGet Started
BlogAgent checkpointing is far from production-grade resiliency<br>June 15, 2026<br>Giselle van Dongen<br>Agent frameworks now advertise "durable execution" and "resiliency," usually built on checkpointing. But the guarantees you get fall far short of what it takes to deploy long-running agents to production.
Productionizing agents is hard. Agents are long-running, expensive, and fragile. A production agent calls a model that costs money for each invocation, fires tools with side effects — emails, payments, deployments — and might wait for days for a human to click approve. In the meantime, processes restart, connections drop, you hit rate-limits, and new deploys happen. So frameworks started adding resiliency features: checkpointing, resumable threads, "durable execution."
These features are useful, but they usually only solve a small part of the problem: recovering completed work. 80% of the hard work is still on the plate of the developer.
What agent SDKs mean by "durable"
Strip away the naming differences and the pattern across agent frameworks is the same: periodically snapshot the agent's state to a database, and let the caller resume from the last snapshot. Recovery means loading that state object and starting over from the boundary. The process has no record of where inside your code execution was: it just knows what the state looked like at the last save point.
That same model is used across the ecosystem: LangGraph does node-level checkpointing, Google ADK has event persistence, Mastra does step-level snapshots. Each of these boils down to saving state at boundaries, restarting coarsely, and losing the work in between.
The part of the iceberg below the checkpoint
Checkpoints help with recovery, but agents are distributed applications and making them durable requires more that that.
Recovery should land on the failing step, not the nearest boundary.
A checkpoint can only take you back to a boundary — it doesn't recover execution, it restarts it. If a tool performs three calls — call an API, send an email, write to a database — and the process dies after the second, restarting from the checkpoint re-runs all three: nothing ever recorded that the first two happened. Checkpoints describes state at a boundary, not how far your code got in between.
The same goes for parallel work. Imagine, a planner agent that kicks off 10 parallel researcher agents do an hour of work, 5 have finished, the sixth one fails. What gets recovered depends on the exact checkpointing implementation, but often it means some or all researchers redo work.
What we actually want: Real durable execution — persist every step the code executes (each LLM call, tool call, sleep, RPC), and on recovery, the code re-runs, completed steps return their journaled results instead of executing again, and execution fast-forwards to the exact step that didn't finish and continues live from there. The three-call tool recovers to call three.
Someone has to notice the failure.
A library cannot supervise itself. It dies together with the process. With a checkpoint SDK, "recovery" means something else must detect that a run was ongoing and needs to be retried, and it must know how to restart it.
With checkpointing, this is often left up to you. You must either have your upstream client detect failures and retry or put a queue in front of your agent service. You need to generate a stable ID for each run, and on a retry, deduce the right ID and dispatch the agent.run(...) task again. This is complex, error-prone, extra infra to deploy/operate, and it doesn’t contribute at all to your use case.
What we actually want: A durable orchestrator that detects failures immediately and automatically redispatches in milliseconds.
Retries have to survive the thing that's failing.
In-process retry state, like an attempt counter in a local variable or a backoff implemented as sleep(), dies with the process. Retry policies, attempt counts, and backoff schedules belong in the same durable substrate as the execution itself, including the end state.
What we actually want: Retry policies that are respected across process failures, restarts, and redeploys, that intelligently treat some errors as retryable and others as terminal, and that allow for fine-grained retry policy definition for different actions and side effects of our system.
The same durability guarantees should cover the entire workflow.
Most agents end up as deterministic workflows, of which some steps are agentic and others are not (write to DB, send email, …).
To cover the steps outside of agent.run() , the checkpointing solution now also needs to extend outside of your agent.run(). This scenario is so common that many agent SDKs are turning into workflow orchestrators to fill this gap.
But they don’t give the same guarantees as the workflow orchestrators we have already been using for years,...