Postgres Is All You Need for Durable Execution

Postgres-backed Durable Workflow Execution | DBOS

DBOS Video: 4 Billion Workflows Per Day

PricingCustomersResources

Docs

Blog

Explore

...

Get started

July 24: DBOS User Group Meeting

CustomersPricingBlogDocs Resources

Start your project

Back to insights Postgres Is All You Need for Durable Workflow Execution Peter Kraft

May 20, 2026

DBOS Architecture

Durable execution is a simple but powerful tool for building reliable programs. The idea is that as your program runs, you regularly checkpoint its progress to a database. That way, if your program ever crashes or fails, you can reload from the last checkpoint to recover it from its last completed step. You can think of this like saving in a video game: you regularly “save” your program’s progress so that if it crashes, you can “reload” it from its last checkpoint. This is most valuable when restarting the workflow after a failure would be costly (e.g., LLM token usage in AI workflows) or incorrect (e.g., placing duplicate orders in an e-commerce workflow). Most commonly, durable execution is implemented via external orchestration . This is the pattern used by popular systems like Temporal, Apache Airflow, and AWS Step Functions. In this model, durable programs are written as workflows of steps whose execution is coordinated by a central orchestrator. When a client submits a workflow, the orchestrator creates a record for it in a data store then dispatches it to a worker for execution. Each time a worker completes a step, it sends the step’s outcome back to the orchestrator. The orchestrator checkpoints the output in its data store, then dispatches the next step. If a worker crashes or fails, the orchestrator dispatches its workflows to another worker, starting them from their last checkpointed step.

In this blog post, we’ll argue that external orchestration is fundamentally overcomplicated . The core idea of durable execution is to checkpoint program state in a database. But if durable execution is about databases, then there’s no reason to have a separate orchestrator server. Instead, it’s simpler and more efficient to use the database itself as an orchestrator. To make this more concrete, we’ll focus specifically on building durable execution on Postgres, because its popularity, scalability, and rich ecosystem make it an ideal choice. In a Postgres-backed durable execution system, application servers directly communicate with Postgres to execute workflows instead of going through a central orchestrator. A client submits a workflow for execution by creating an entry for it in a Postgres workflows table. Application servers poll the table for workflows to dequeue and execute. As a server executes a workflow, it checkpoints the output of each step to Postgres. If a server executing workflows crashes or fails, another server can recover its workflows from their checkpoints.

This design renders a central orchestrator unnecessary because application servers can coordinate through Postgres. Instead of relying on a central orchestrator to dispatch workflows to workers, servers cooperatively dequeue workflows from a Postgres table, using mechanisms such as locking clauses to ensure each workflow is dequeued by exactly one worker. Instead of relying on an orchestrator to checkpoint step outputs, workers checkpoint steps to Postgres themselves. If multiple workers try to execute the same workflow simultaneously, Postgres database integrity constraints let them detect the duplicate work on checkpoint and back off. Replacing a central orchestrator with Postgres (or another database) makes durable execution fundamentally simpler. In particular, it means hard problems such as scalability, availability, observability, and security can be addressed using well-understood Postgres-native solutions. In the rest of this post, we’ll talk about how. Scalability and Availability The scalability and availability of a database-backed durable execution system are fundamentally determined by the underlying database. The system can scale horizontally by adding more worker servers, so its maximum capacity is determined by how quickly the database can process workflows. Similarly, workers are fungible and can freely recover each other’s state, so the system is available as long as the underlying database is available. When using Postgres specifically, this is beneficial because Postgres scalability and availability are well-studied problems with robust solutions. For scalability, a single Postgres server can vertically scale to handle tens of thousands of workflows per second, and further scaling can be achieved by using distributed (e.g., CockroachDB) or sharded Postgres. For availability, Postgres supports streaming replication with automatic failover and managed offerings provide multi-AZ deployments with high-availability SLAs out of the box. As a result, the decades of engineering work and research that have gone into...

Postgres Is All You Need for Durable Execution

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down