Scaling Redis Pub/Sub to Millions of Channels and Hundreds of Subscriber Nodes | Centrifugo
Skip to main content<br>Redis Pub/Sub is a popular choice for passing messages between nodes in real-time messaging systems. It lets a system run many nodes — each holding many real-time client connections — and deliver each message to the nodes that have interested subscribers.
At small scale things are quite simple. But at millions of channels and hundreds of subscriber nodes — the scale some Centrifugo users run at — Pub/Sub stops being a simple part of the design: a single Redis instance is limited by one CPU core, and switching to Redis Cluster can make throughput worse. This post walks through those gotchas, and the techniques that get from a single instance's ceiling to millions of messages per second — across isolated Redis shards and specifically in a Redis Cluster.
Jump to the end for TLDR
Applies to Valkey too<br>This post talks about Redis , but everything here applies equally to Valkey too.
For a general-purpose real-time messaging server, every deployment looks different, so the design has to handle many cases. Two facts about real-time messaging specifics shaped the decisions in this post:
The system can have a lot of active channels — millions of them. There might be one per user, per document, or per game session, each created and thrown away all the time, so the server is constantly subscribing and unsubscribing. Usually each channel carries fairly light traffic (up to 60-100 messages per second); the load is spread across many channels rather than concentrated in a few.
There can be a lot of subscriber nodes too. Any node might care about any channel at any moment, so each one subscribes to whatever its clients need and has to receive everything published there. Once a deployment grows to hundreds of nodes, anything that costs something per node adds up very fast. In Centrifugo's case, one node usually serves up to 100k-200k connections — so setups which aim to have millions of real-time connections end up with hundreds of connection nodes, each subscribed to the channels its users care about.
Keeping up with Redis
Redis Pub/Sub is intentionally simple: a client subscribes to a named channel, and a publish to that channel reaches whoever is subscribed at that exact moment. Nothing is stored and nothing is retried, so a subscriber that wasn't connected when a message went out never sees it — delivery is at most once.
Baseline: publishers fan out to subscribers through RedispublisherpublisherRedisPUB/SUBPUBLISHnode 1node 2node 3fan-out<br>Redis runs every command (in this case PUBLISH and SUBSCRIBE) on a single thread sequentially. Modern Redis and Valkey can spread network I/O across extra threads, which lifts throughput. Also, an application may issue subscriptions to Redis replicas to spread the load a bit. However, command execution stays on the Redis master and remains serialized, so one instance still has a performance ceiling you can hit.
Whatever Redis setup you run, your app has to keep up with Redis first — otherwise it's the first bottleneck, before Redis itself is even reached. With Redis Pub/Sub, each application node — also called a subscriber node here — does two things: it publishes messages and subscribes to receive them.
Efficient publishing
Centrifugo publishes over a pipelined connection instead of the connection-pool approach, where each command takes its own network round trip. The rueidis client gathers commands issued close together and writes them as a single batch, with a small flush delay (MaxFlushDelay, about 100µs) marking a batch boundary. One round trip then carries many publishes, so the publisher feeds Redis instead of stalling on the wire.
Pipelining lifts the overall throughput and surprisingly cuts CPU usage on both the client and Redis sides due to reduced READ/WRITE syscalls — these gains were shown in detail before in Improving Centrifugo Redis Engine throughput and allocation efficiency with Rueidis Go library.
Efficient subscribing
The SUBSCRIBE command supports subscribing to many channels at once. When many clients reconnect at once or application nodes resubscribe after a network glitch, a few large commands with batched channels are far more efficient across the client, protocol, and Redis layers.
Redis keeps a buffer for each Pub/Sub connection, and if the reader doesn't drain it fast enough, Redis drops the connection. So the application has to drain the socket as fast as it can and delegate processing to a dedicated pool of workers . Because Pub/Sub delivery is at most once, the workers have some freedom in how they handle overload — including dropping messages at the application level.
It's worth tracking an application metric for the end-to-end lag of a message sent through Pub/Sub — record the publication timestamp on the publisher side and subtract it from the time the message is received on the subscriber side. In most cases, workers...