You've Got (Too Much) Mail: Behind the Scenes of the 3/25/26 Voice Outage

mooreds1 pts0 comments

You’ve Got (Too Much) Mail: Behind the Scenes of the 3/25/26 Voice Outage

Discover3 of 9<br>Safety 4 of 9

Quests 5 of 9

Support 6 of 9

Blog 7 of 9

Developers 8 of 9

Careers9 of 9

Log InLog in

Download<br>Nitro<br>Discover

Safety

Quests

Support

Blog

Developers

Careers<br>Log In

App Store

Lorem ipsum dolor sit<br>Mi neque maecenas

Engineering & Developers

You’ve Got (Too Much) Mail: Behind the Scenes of the 3/25/26 Voice Outage

Discord Engineering

April 29, 2026

This article is a collaborative piece by Bo Ingram, Senior Staff Engineer on Realtime Infrastructure , and Stephen Birarda, Senior Staff Engineer on Audio/Video infrastructure .<br>The sad trombone haunts our dreams. Womp, womp, womp, woOoOoOoOmp. Womp, womp, womp, woOoOoOoOmp. Womp, womp, womp, woOoOoOoOmp. A cascading outage manifests in many ways: processes crashing, users reconnecting. As an on-call engineer, you often see it firsthand when the alert notifications reach your phone.<br>On March 25th, voice and video on Discord suffered major degradation beginning at 12:13 PDT until 15:30 PDT. During this time, users were mostly unable to start or join calls, seeing an “Awaiting Endpoint” message in their call status.<br>As part of a routine infrastructure change, a configuration update accidentally caused a large portion of Discord’s session management servers to shut down simultaneously. Sessions are the heartbeat of Discord’s real-time infrastructure — every connected device maintains one, and they coordinate nearly everything you see and hear in the app. Losing 17% of them at once sent a cascade of impacts through several downstream systems, ultimately overwhelming a service responsible for routing voice and video calls to the right servers around the world.<br>Since the incident, we’ve taken time to analyze our systems, understand why they degraded in the face of the cascading load from our session outage, and determine how we can leverage our experience from the outage to level up our infrastructure. In a distributed system, sudden load is a dangerous proposition. It hurtles through old bottlenecks and seeks out new ones. In this post, we’ll peek behind the curtain and see how one seemingly innocuous change overwhelmed a system multiple hops away and how our not-fun afternoon helped us improve Discord.

The Buildup

Our Realtime Infrastructure team is in the midst of a Kubernetes migration for our Elixir services. It’s the blessed path for deploying services at Discord, and we’ve been gradually shifting our existing services to fit this model and join in the compounding leverage of all we’ve built for this platform.<br>The stateful Elixir systems at Discord power much of our backend. Each host runs thousands of in-memory stateful processes that drive critical features like servers (from this point, we’ll refer to servers by their internal name: “guilds”), presence, and calls. When we need to take a host offline, we must ensure that all processes running on that host have handed off their data to another node to avoid any interruption. To verify this property, our deployments monitor an entity count on each server and only terminate the pod or stop the server when this monitor reaches zero.

Before the incident, we were wrapping up the migration of our session management service. Our sessions service manages users’ sessions (we picked a good name); each device you’re connected on gets a session process in our cluster. If you’re connected to Discord on both web and desktop, you’ll have two sessions. If you somehow manage to get Discord running on your smart fridge, you’ll have a session for that too. All chat messages, all presence updates, anything we push to the client over the websocket, goes through your session. It’s very important!<br>We’d noticed that CPU utilization was running hotter than we’d like over the weekends, and therefore we planned to tune the cluster to lower this metric. We decided to vertically scale our pods by increasing their CPU and memory while lowering the overall pod count proportionally, that way we can test and see if the higher scheduler utilization was a fixed overhead per pod or something that scaled with the number of sessions.<br>We prepared a PR to change the resources and pod counts and began deploying it.

Dropping Sessions

These changes were deployed to our first zone at 12:13 PDT. As Kubernetes applied our changes, it terminated 50% of the pods due to our decreased replica count. As a backstop, the service attempts to handoff its processes upon receiving a signal from Kubernetes, but a safety check designed to wait for other events in progress to complete meant the termination grace period in Kubernetes elapsed before handoffs could begin. Since the sessions service runs in three equally balanced zones, 17% of sessions across Discord were ungracefully stopped.<br>Our Elixir systems are powered by GenServer processes, a generic server process in Elixir. Across all of our services, there are millions of these processes...

discord womp sessions outage infrastructure processes

Related Articles