The principles of extreme fault tolerance

The principles of extreme fault tolerance — PlanetScaleIntroducing Database Traffic Control™: resource budgets for your Postgres query traffic.Learn more

Navigation Blog|Engineering Table of contents «Close »Table of contents Principles Architecture Processes Failure modesNon-query-path failures Cloud provider failures PlanetScale-induced failures

PlanetScale Postgres is the fastest way to run Postgres in the cloud. Plans start at just $5 per month. Learn more

Get the RSS feed

The principles of extreme fault tolerance Max Englander | July 3, 2025 PlanetScale is fast and reliable. Our speed is the best in the cloud due to our shared nothing architecture that enables us to utilize local storage instead of network-attached storage. Our fault tolerance is built on top of principles, processes, and architectures that are easy to understand, but require painstaking work to do well. We have talked about our speed a lot. Let's talk about why we are reliable. Principles Our principles are neither new nor radical. You may find them obvious. Even so, they are foundational for our fault tolerance. Every capability we add, and every optimization we make, is either bound by or born from these principles. Isolation Systems are made from parts that are as physically and logically independent as possible. Failures in one part do not cascade into failures in an independent part. Parts in the critical path have as few dependencies as possible. Redundancy Each part is copied multiple times, so if one part fails, its copies continue doing its work. Copies of each part are themselves isolated from each other. Static stability When something fails, continue operating with the last known good state. Overprovision so a failing part's work can be absorbed by its copies. Architecture Our architecture emerges from the principles above. Control plane Provides database management functionality. Database creation, billing, etc. Composed of parts which are redundant, spread across multiple cloud availability zones. Less critical than the data plane, and so has more dependencies. E.g. uses a PlanetScale database to store customer and database metadata. Data plane Stores database data and serves customer application queries. Composed of a query routing layer and database clusters. Each of these parts are both regionally and zonally redundant and isolated. The most critical plane, with fewer dependencies than the control plane. Does not depend on the control plane. Database clusters Composed of a primary instance and a minimum of two replicas. Each instance is composed of a VM and storage residing in the data plane. Instances evenly distributed across three availability zones. Automatic failovers from primaries to healthy replicas in response to failures. Customers may optionally run copies in read-only regions. Enterprise customers may optionally promote read-only regions to primary. Extremely critical. Extremely few dependences. Processes Within this architecture, we apply processes that reinforce our systems' overall fault tolerance. Always be Failing Over Very mature ability to fail over from a failing database primary to a healthy replica. Exercise this ability every week on every customer database as we ship changes. In the event of failing hardware or a network failure - fairly common in a big system running on the cloud - we automatically and aggressively fail over. Query buffering minimizes or eliminates disruption during failovers. Synchronous replication MySQL semi-sync replication, Postgres synchronous commits. Commits stored durably on at least one replica before primary sends acknowledgment to the client. Enables us to treat replicas as potential primaries, and fail over to them immediately as needed. Progressive delivery Data plane changes are shipped gradually to progressively critical environments. Database cluster config and binary changes are shipped database by database using feature flags Release channels allow us to ship changes to dev branches first, and to wait a week or more before shipping those same changes to production branches Minimizes the impact of our own mistakes on our customers. Failure modes How adherence to the principles, architecture, and processes above enable us to tolerate a variety of failure modes. Non-query-path failures Because our query path has extremely few dependencies, failures outside of the query path do not impact our customers' application queries. As an example, a hypothetical failure in one of our cloud providers' Docker registry services might impact our ability to create new database instances, but will not impact existing instances' ability to serve queries or store data. Likewise, failures, even total failure, of our control plane would impact our customer's ability to change their database cluster's settings, but would not impact that cluster's query service. Cloud provider failures We run on AWS and GCP, which can and do fail in many different ways. Instance If a failure...

The principles of extreme fault tolerance

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

SpaceX not the behemoth everyone thought

The Mirror Is Part of the Machine

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits