Coinbase Postmortem Reveals How a Localized AWS Failure Triggered a Multi-Hour Trading Outage - InfoQ
BT
InfoQ Software Architects' Newsletter
A monthly overview of things you need to know as an architect or aspiring architect.
View an example
Enter your e-mail address
Select your country
Select a country
I consent to InfoQ.com handling my data as explained in this Privacy Notice.
We protect your privacy.
Close
Helpful links
About InfoQ
InfoQ Editors
Write for InfoQ
About C4Media
Diversity
Choose your language
En
中文
日本
Fr
July25,2026
Online InfoQ AI Engineering Certification
Production AI calls on retrieval, agents, evals, and infrastructure, checked with peers.<br>Register Now.
Aug13,2026
Online InfoQ Architect Certification
Distributed systems, decentralized decisions, platform engineering, and AI architecture.<br>Register Now.
Nov16-20,2026
QCon San Francisco
What's working across AI, architecture, and leadership, from the teams doing it.<br>Register. Early bird ends July 14.
Apr13-16,2027
QCon London
What early-adopter teams have proven in production, across 15 engineering tracks.<br>Register. Early bird ends July 14.
InfoQ Homepage
News
Coinbase Postmortem Reveals How a Localized AWS Failure Triggered a Multi-Hour Trading Outage
DevOps
Coinbase Postmortem Reveals How a Localized AWS Failure Triggered a Multi-Hour Trading Outage
Jun 16, 2026
min read
by
Craig Risi
Write for InfoQ
Feed your curiosity.<br>Help 550k+ global<br>senior developers<br>each month stay ahead.Get in touch
Listen to this article - 0:00
Audio ready to play
Your browser does not support the audio element.
0:00
0:00
Normal1.25x1.5x
Like
Reading list
Coinbase has published a detailed postmortem of its May 7, 2026, outage, revealing how a localized cooling failure inside an AWS data center escalated into a multi-hour disruption that halted nearly all trading activity across the cryptocurrency exchange. While the initial incident originated from an AWS thermal event in a single availability zone, Coinbase's investigation found that architectural dependencies within its own systems, including a matching engine tightly coupled to the affected zone and cascading messaging infrastructure failures, significantly prolonged recovery efforts.
The outage began when multiple cooling units failed simultaneously in an AWS data hall within the US-East-1 region, forcing thermal shutdowns of affected racks and taking EC2 instances and EBS volumes offline. Coinbase customers were unable to buy, sell, deposit, withdraw, or transfer assets for several hours, while institutional clients experienced widespread disruption to order routing and exchange services. Full recovery took much of the following day, with trading restored incrementally through cancel-only and auction modes before normal operations resumed.
According to Coinbase, the most significant factor delaying recovery was the design of its exchange matching engine. To achieve the ultra-low latency required for high-frequency trading, the system operates as a Raft-based cluster within a single AWS Cluster Placement Group. This architecture intentionally collocates nodes to minimize network latency between consensus members. However, when the AWS outage took down three of the cluster's five nodes, the system lost quorum and could no longer process trades.
The company acknowledged that while the architecture optimized performance, it lacked an automated mechanism for failover to another availability zone. Recovery required emergency code changes, manual cluster reconstruction, and careful restoration of quorum before trading could safely resume. The incident exposed a classic engineering trade-off: optimizing for latency and performance can sometimes come at the expense of resilience during rare infrastructure failures.
Coinbase's postmortem also identified a separate issue involving its event-streaming infrastructure. Kafka workloads responsible for distributing operational data became stranded in the impaired availability zone, creating significant backlogs and delaying service restoration even after core trading systems were recovering. Engineers ultimately had to manually migrate partitions and rebalance workloads to restore normal data flow across the platform.
The combination of the matching-engine failure and messaging backlog transformed what began as a localized cloud infrastructure issue into a platform-wide outage. Coinbase noted that either issue independently would have been manageable, but together they created a recovery process far more complex than anticipated.
The outage has reignited discussion around cloud concentration risk and the operational realities of building critical financial services on hyperscale infrastructure. Although AWS regions are designed around multiple availability zones, the Coinbase incident demonstrates how applications can still develop hidden dependencies on specific locations, particularly when performance...