Mantissa, a distributed workload orchestration system

artursapek1 pts0 comments

Init<br>Ten years ago, after the release of Docker Swarmkit (on which I worked on the Raft backed distributed datastore), I published a post about what could be next for container orchestration1. At the time, I described a vision for orchestration that moved away from centralized decision making and embraced a more distributed model. Since then, most container orchestration systems have kept relying on consensus algorithms to manage state and coordinate decisions across the cluster; going through a central control plane.

This choice was justified. The workloads these systems are designed to run are mission critical, and the control plane must be reliable. It cannot afford to make incorrect decisions based on stale values, accidentally move replicas around, or schedule workloads where they do not belong. Strong consistency solves a lot of problems, even if it comes at a cost.

The cluster contains nodes with different roles (Manager / Workers) and has to maintain a quorum to ensure consistency. This quorum, if lost, blocks any progress and could require manual intervention to recover. It could also become a bottleneck in terms of scalability since all cluster actions have to be acknowledged by the quorum before they are applied. Additionally, most orchestrators rely on third party key value stores that are embedding Paxos or Raft, making maintainance and upgrades all the more difficult. The amount of moving pieces involved in maintaining a consistent cluster state can be overwhelming.

Other approaches were proposed, such as the Omega Scheduler from Google2, intended to address some of the scaling limitations of a centralized control plane. Some of its ideas eventually found their way into Borg (according to the paper), but Omega itself never became the dominant orchestration model. This makes sense since it introduces a different set of trade-offs. For example, the flexibility gained through optimistic concurrency control and compare-and-swap semantics comes with a significantly more complex control plane logic. Handling conflicts, retries and concurrent schedulers is harder than programming against a centralized scheduler operating on strongly consistent/linearizable state.

Still, I liked the idea of a completely distributed control plane and wanted to experience how it could work and feel on a real cluster with real workload.

After leaving Docker in 2016, I started building a prototype. Instead of relying on a centralized data source on which schedulers would compete for resource slots, which was the premise of Omega, I explored a design based on Conflict-Free Replicated Data Types, or CRDTs3. Each node would own their state, replicated and merged by other nodes on submitted events. No dependencies on an external key value store or consensus algorithm. No centralized control plane and data source. This came with its own set of trade-offs: for example, replicating the whole cluster state on all nodes. Back then, I assumed this was an acceptable trade-off to simplify cluster management.

The choice of CRDTs was to remove the need for a consensus algorithm, and allow nodes to independently make scheduling decisions with little coordination. Consensus and CRDTs solve different problems however: Consensus provides agreement while CRDTs provide convergence. A CRDT-based system can tolerate a large number of failures and continue making progress without requiring a quorum, but it must also accept that different nodes may temporarily observe different versions of reality.

Eventually consistent systems come with their own set of challenges. Reconciliation is notoriously difficult to implement correctly. Preserving causality, handling concurrent updates and deciding when values are safe to expose to higher layers of the system all require careful engineering4. These are not impossible problems, but they are easy to underestimate.

The original prototype was fairly limited. It consisted of a slot scheduling algorithm and topology management plane. It was written in Rust, used Cap’n Proto RPC, and relied on CRDTs backed by a simple key-value store. The project never reached a stage where it could be considered usable. Then my priorities shifted, and I stopped working on it. In retrospect, I do not think the timing was right.

Kubernetes was already becoming the dominant orchestration platform. The ecosystem was growing rapidly and attracting contributions from some of the largest companies in the industry. If Docker SwarmKit, with a talented team and significant resources behind it, could not seriously challenge Kubernetes, there was little chance that a side project maintained by a single engineer would make much of a dent. If anything, there was, from my end, the constant regret that Swarmkit didn’t embrace that model, making container orchestration shockingly easy (it obviously did, to some extent) and capturing a segment of the market that Kubernetes sometimes struggles to satisfy.

Over the last few years, I...

cluster control plane orchestration centralized consensus

Related Articles