Taming production database migrationsTaming production database migrations
Disclosure: Neon paid me to write this article. The argument and opinions are mine.
Production migrations are scary. But you can do it safely, without downtime or crashing clients.
Here’s how.
The two boundaries
Three things are in play.
Database. Changed by migrations. A migration can change the schema, data, or both.
Server. Changed by code deploys. It consumes the schema, exposes the API, and owns the mapping.
Clients. Changed by user updates, reloads, installs, or not at all. They consume the API on their own schedule.
That gives you two boundaries.
Schema boundary. Database <> server. The shared surface is the schema.
API boundary. Server <> client. The shared surface is the API.
Atomicity
The dream is to avoid coordinating schema and code at all. Change the schema, deploy the code, and have the whole thing land as one all-or-nothing operation.
You can get close at the schema boundary because you control both moving parts. You run the migration. You deploy the server. If you can afford to stop traffic for the release, or if you trust your rollback path, that can be a perfectly reasonable tradeoff. Plenty of smaller apps live there and do fine.
You cannot get that at the API boundary because you do not control clients in the wild. A breaking API change breaks every mobile app, desktop app, web app, and CLI until they update. Those clients move when they move.
That is the limit. Atomicity can buy simplicity where you control both sides. Expand-and-contract works everywhere because it does not need atomicity.
Expand-and-contract
This is how you keep shipping with zero downtime, no broken clients, and no permanent compatibility cruft.
Expand-and-contract is a discipline where you change an interface in steps, so the provider and the consumer do not have to move at the same moment.
Instead of replacing the old shape with the new shape in one move, you keep the old shape working, add the new shape beside it, move consumers over, then remove the old shape.
At every step, every live consumer has a shape it understands.
From principle to example
Consider a production system with a database, a server exposing a REST API, and clients that move on different schedules. The web app reloads quickly. The mobile app updates slowly. The desktop app and CLI may sit on old versions for a long time.
Now imagine you need to rename users.bio to users.description without breaking API clients that still use bio.
Current state.
Database has users.bio
API exposes bio
Clients read and write bio
Target state.
Database has users.description
API exposes description
Clients read and write description
The objective is to land the rename with zero downtime, no maintenance window, no crashed clients, and no permanent compatibility cruft. The old shape should be gone at the end, not carried forever as a tax on every future change.
The same rename crosses two boundaries. The schema side moves on your schedule. The API side moves on the clients’ schedule.
At the schema boundary
At the schema boundary, the consumer is your own server. That is why this part can finish quickly.
Schema and data migrations
A migration can change the schema, the data, or both. Schema changes alter the shape the server sees. Data changes move rows into that shape.
The objective is to keep the schema non-breaking at all times.
Forward-only migrations
Prefer forward-only migrations. A down migration is often impossible and rarely tested well enough to trust under pressure. Instead, fix mistakes with another forward migration.
The rename at the schema boundary takes six moves. Each move is one logical change, a schema migration, a data migration, or a server deploy.
Current state
This is the baseline. Everything is simple and aligned.
state<br>API shape { bio: string }<br>DB schema { bio: string }
server mapping<br>read api.bio ← db.bio<br>write api.bio → db.bio<br>Move 1, schema expands
Run a schema migration that adds description as nullable.
state<br>API shape { bio: string } ← unchanged<br>DB schema { bio: string, description: string|null } ← sparse
server mapping<br>read api.bio ← db.bio<br>write api.bio → db.bio<br>Old server code still reads and writes bio, and that remains legal. The new column exists, but nothing depends on it yet.
Move 2, server follows
Deploy server code that starts dual-writing.
state<br>API shape { bio: string } ← unchanged<br>DB schema { bio: string, description: string|null } ← sparse
server mapping<br>read api.bio ← db.bio<br>write api.bio → db.bio, db.description<br>The server still reads from bio, so old rows are fine. New writes fill both columns.
Move 3, data catches up
Run a data migration that backfills description = bio where description is null.
state<br>API shape { bio: string } ← unchanged<br>DB schema { bio: string, description: string|null } ← backfilled, in sync
server mapping<br>read api.bio ← db.bio<br>write api.bio → db.bio, db.description<br>The...