What I learned using AI to build a Kubernetes Operator for Supabase's Multigres

DevOpsy2 pts0 comments

Writing a Kubernetes Operator in the Age of AI - Numtide<br>Writing a Kubernetes Operator in the Age of AI<br>June 15, 2026 — Fernando Villalba

TL;DR

Treat the user-facing spec as the one thing that can’t drift. Everything else is cheap to refactor; the contract isn’t.

Don’t install AI frameworks. Read them, steal the ideas, and write your own skills instead.

Run the mechanical work — reviews, audits, commit messages, changelogs, doc checks — as a factory of fresh-context agents, each with one narrow job, orchestrated by processes you control. Share them with the team so the development is consistent

When a skill lets something through, fix the skill. Bad outputs are defects in the line, not one-off noise.

Bug audits need design context loaded up front and a second agent to filter hallucinations, or you drown in false positives.

Tests and code from the same AI source share the same blind spots. Verify against real runtime behavior instead of obsessing over 100% code coverage — this is especially true on greenfield projects.

AI won’t tell you a bad idea is a bad idea. It’ll just build a polished version of it. Human judgment still owns every design call.

Six months into building a production Kubernetes operator for Multigres with AI, the code stopped being the hard part. Most of the effort went into design, hygiene, continuous improvement, and review. What I learned in 1,303 commits is that the best way to use AI is to build a factory floor of specialists you orchestrate yourself.

Understanding the domain

We were contracted to work closely with the Multigres team, led by Sugu Sougoumarane, the creator of Vitess, building a new distributed Postgres system. The operator’s job was to act as a provisioner of Multigres on Kubernetes, enabling users to use it simply and declaratively.

For a reader who hasn’t built a Kubernetes operator before, “an operator for a distributed database” is much more involved than it sounds. The operator provisions cells, shards, and connection pools, registers each piece with a topology service, and keeps the running cluster converging toward whatever the user put in a YAML.

The hard part is the state transitions. Scaling down can’t just delete a pod. If that pod is the primary Postgres, deleting it loses data. The operator has to coordinate with the replication state machine, demote the primary, wait for standbys to catch up, and unregister from the topology before Kubernetes can safely garbage-collect. Scale-up has its own dance that StatefulSets couldn’t carry, so we built our own pod management.

All of this happens through Kubernetes’ declarative model, where any piece can fail at any time and the operator’s job is to keep nudging the world toward the desired shape.

Three things made the job harder than an average operator build:

Multigres was still in very early stages of development, which made it a moving target for us to design and implement the operator against.

It required relatively complex templating logic across multiple CRDs and component types.

Multigres, like Vitess, is middleware over an unsharded engine, which pushes a lot of database-layer orchestration into the operator itself rather than into the database.

On that last point: where does placement and rebalancing actually live?

In databases like CockroachDB, TiDB, and YugabyteDB, the answer is inside the database. Add a node and it automatically starts taking on data. Decommission one and the cluster moves the data off before you can safely remove it. The operator’s job on scale events is mostly to signal intent and wait.

Vitess and Multigres work differently, by design. They keep the engine (MySQL or Postgres) unmodified — stock, unsharded, with no concept of cluster membership. You get SQL compatibility, mature tooling, and shard isolation out of the box, which is the whole point. The trade-off is that resharding and decommissioning can’t live inside the database, so they have to be orchestrated externally. That pushes responsibility and complexity into the operator and forces a richer YAML spec — the price of a heavier control plane for a well-understood storage layer.

In practice this meant our operator had to drive Postgres-layer orchestration directly: checking whether a pod was a primary, coordinating demotion, waiting for synchronous standbys to catch up, unregistering from the topology service, and only then letting Kubernetes delete the pod.

Understanding the domain took a chunk of our time, and it was well spent. It involved reviewing the Multigres codebase and design documents as well as reviewing all the operator and controller-runtime features to see what we would need and how we would design the work.

Having AI as a companion at this stage was also very helpful because we could feed it the entire codebase and documentation and ask questions, no matter how dumb, about everything. It really made the research period much quicker and easier and it took a big load off the Multigres team...

operator kubernetes multigres design database build

Related Articles