Can Postgres Queues Scale?

KraftyOne2 pts1 comments

Making Postgres Queues Scale | DBOS

June 11 DBOS User Group: Workflow Observability

PricingCustomersResources

Docs

Blog

Explore

...

Sign in

Get started

July 24: DBOS User Group Meeting

CustomersPricingBlogDocs<br>Resources

Start your project

Login

Login<br>Start your project

Back to insights<br>Can Postgres Queues Scale?<br>Qian Li

Peter Kraft

June 2, 2026

How To

The hardest part of building a Postgres-backed durable workflows library is scaling Postgres-backed task queues. At scale, thousands of workers may be polling the same database table at the same time, exposing subtle performance issues deep inside the database. Many of these bottlenecks remain invisible until you hit truly large workloads.<br>In this post, we'll share the lessons we learned from scaling Postgres-backed durable queues for users running tens of billions of workflows per month across thousands of servers.<br>Lesson 1: (Re-)Discovering SKIP LOCKED<br>To make Postgres-backed queues work at all, the first problem we had to solve was contention between multiple workers dequeueing the same workflows. At a high level, the way Postgres-backed queues work is that clients enqueue workflows by adding them to a queues table, and workers dequeue and process the oldest enqueued workflows (assuming a FIFO queue). Naively, each worker runs a query like this to find the N oldest enqueued workflows, then dequeues them:

As soon as multiple workers run this query concurrently, contention arises. Every worker sees the same oldest queued workflows and attempts to dequeue them at the same time. But each workflow can only be dequeued by a single worker, so most workers will fail to find new work and have to try again. At a large scale, this contention creates a bottleneck in the system, limiting how rapidly tasks can be dequeued.

Fortunately, Postgres provides the primitive required to solve this problem: locking clauses. Here's an example of a query using FOR UPDATE SKIP LOCKED:

Selecting rows in this way does two things. First, it locks the rows so that other workers cannot also select them. Second, it skips rows that are already locked, selecting not the N oldest enqueued workflows, but the N oldest enqueued workflows that are not already locked by another worker . That way, many workers can concurrently pull new workflows without contention . One worker selects the oldest N workflows and locks them, the second worker selects the next oldest N workflows and locks those, and so on.

Locking clauses make Postgres-backed queues possible–SKIP LOCKED is one of those old Postgres tricks that keeps getting rediscovered. Without them, contention between workers prevents scaling beyond ~100 workflows per second. With them, Postgres can scale far further, but achieving that scaling requires more optimizations.<br>Lesson 2: Mind the Transaction Isolation Levels<br>While locking clauses improved performance dramatically, we soon reached another contention-related bottleneck: at scale, dequeue operations would frequently fail with Postgres “Serialization Failure” exceptions and need to be retried. When processing more than ~1000 workflows per second, the majority of dequeue operations encountered serialization failures, creating a performance bottleneck.<br>The culprit turned out to be Postgres transaction isolation levels. The dequeue transaction originally ran at REPEATABLE READ so we could support global queue limits like "run at most N workflows concurrently across all workers." Enforcing those global limits requires workers to share a globally consistent view of queue state, and REPEATABLE READ (in Postgres) guarantees that a transaction will operate on a fixed “snapshot” of the database as it was when the transaction started, and will not “see” the effects of concurrent transactions that complete while it is running.<br>The problem is that REPEATABLE READ becomes expensive at high concurrency. If multiple workers concurrently modify overlapping rows, Postgres would abort one of them with a serialization failure. At scale, workers spent more time retrying transactions than processing workflows.<br>The key realization was that the largest queues almost never used global flow control. At a very large scale, users typically rely on local limits instead, such as "run at most 10 workflows per worker", which do not require cross-worker coordination.<br>So we made the isolation level conditional:

Queues with global flow control continue using REPEATABLE READ, while queues without it use READ COMMITTED, which eliminates serialization failures entirely and dramatically improves throughput.<br>Lesson 3: Indexes Aren't Free<br>With both locking clauses and a lower isolation level, contention across workers virtually disappeared, even with thousands of workers. However, when running more than ~8000 workflows per second, we saw a new bottleneck: high CPU usage. The CPU burn came from two seemingly unrelated places: the dequeue query itself and Postgres auto-vacuum. Eventually, we found both...

workflows postgres workers queues scale worker

Related Articles