Coinbase outage postmortem: AWS cooling failure caused cascading breakdown

Coinbase Postmortem Reveals How a Localized AWS Failure Triggered a Multi-Hour Trading Outage - InfoQ

InfoQ Software Architects' Newsletter

A monthly overview of things you need to know as an architect or aspiring architect.

View an example

Enter your e-mail address

Select your country

Select a country

I consent to InfoQ.com handling my data as explained in this Privacy Notice.

We protect your privacy.

Helpful links

About InfoQ

InfoQ Editors

Write for InfoQ

About C4Media

Diversity

Choose your language

中文

日本

July25,2026

Online InfoQ AI Engineering Certification

Production AI calls on retrieval, agents, evals, and infrastructure, checked with peers. Register Now.

Aug13,2026

Online InfoQ Architect Certification

Distributed systems, decentralized decisions, platform engineering, and AI architecture. Register Now.

Nov16-20,2026

QCon San Francisco

What's working across AI, architecture, and leadership, from the teams doing it. Register. Early bird ends July 14.

Apr13-16,2027

QCon London

What early-adopter teams have proven in production, across 15 engineering tracks. Register. Early bird ends July 14.

InfoQ Homepage

News

Coinbase Postmortem Reveals How a Localized AWS Failure Triggered a Multi-Hour Trading Outage

DevOps

Coinbase Postmortem Reveals How a Localized AWS Failure Triggered a Multi-Hour Trading Outage

Jun 16, 2026

min read

Craig Risi

Write for InfoQ

Feed your curiosity. Help 550k+ global senior developers each month stay ahead.Get in touch

Listen to this article - 0:00

Audio ready to play

Your browser does not support the audio element.

0:00

Normal1.25x1.5x

Reading list

Coinbase has published a detailed postmortem of its May 7, 2026, outage, revealing how a localized cooling failure inside an AWS data center escalated into a multi-hour disruption that halted nearly all trading activity across the cryptocurrency exchange. While the initial incident originated from an AWS thermal event in a single availability zone, Coinbase's investigation found that architectural dependencies within its own systems, including a matching engine tightly coupled to the affected zone and cascading messaging infrastructure failures, significantly prolonged recovery efforts.

The outage began when multiple cooling units failed simultaneously in an AWS data hall within the US-East-1 region, forcing thermal shutdowns of affected racks and taking EC2 instances and EBS volumes offline. Coinbase customers were unable to buy, sell, deposit, withdraw, or transfer assets for several hours, while institutional clients experienced widespread disruption to order routing and exchange services. Full recovery took much of the following day, with trading restored incrementally through cancel-only and auction modes before normal operations resumed.

According to Coinbase, the most significant factor delaying recovery was the design of its exchange matching engine. To achieve the ultra-low latency required for high-frequency trading, the system operates as a Raft-based cluster within a single AWS Cluster Placement Group. This architecture intentionally collocates nodes to minimize network latency between consensus members. However, when the AWS outage took down three of the cluster's five nodes, the system lost quorum and could no longer process trades.

The company acknowledged that while the architecture optimized performance, it lacked an automated mechanism for failover to another availability zone. Recovery required emergency code changes, manual cluster reconstruction, and careful restoration of quorum before trading could safely resume. The incident exposed a classic engineering trade-off: optimizing for latency and performance can sometimes come at the expense of resilience during rare infrastructure failures.

Coinbase's postmortem also identified a separate issue involving its event-streaming infrastructure. Kafka workloads responsible for distributing operational data became stranded in the impaired availability zone, creating significant backlogs and delaying service restoration even after core trading systems were recovering. Engineers ultimately had to manually migrate partitions and rebalance workloads to restore normal data flow across the platform.

The combination of the matching-engine failure and messaging backlog transformed what began as a localized cloud infrastructure issue into a platform-wide outage. Coinbase noted that either issue independently would have been manageable, but together they created a recovery process far more complex than anticipated.

The outage has reignited discussion around cloud concentration risk and the operational realities of building critical financial services on hyperscale infrastructure. Although AWS regions are designed around multiple availability zones, the Coinbase incident demonstrates how applications can still develop hidden dependencies on specific locations, particularly when performance...

Coinbase outage postmortem: AWS cooling failure caused cascading breakdown

Related Articles

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

German ruling declares Google liable for false answers in AI Overviews

Britain Became as Poor as Mississippi