Goodbye, Leaf-and-Spine Networks?

ram_rattle1 pts1 comments

Goodbye, Leaf-and-Spine Networks? « ipSpace.net blog

A friend of mine sent me links to a new paper published by AWS engineers, and an associated LinkedIn post which claims:

We got lean, resilient, massive aggregation fabrics that provide 33% better throughput with 69% fewer routers, savings 27% of costs, cutting power usage by 40%, and reducing CO2 emissions.

The obvious question one should ask after reading the hyperventilated Radical Network Redesign blog post is thus: is this the end of leaf-and-spine networks? Of course not. Let’s go into the details.

What exactly did they do? They rediscovered the way Plexxi tried to build data center fabrics. Instead of spine switches, Plexxi tried to connect leaf switches directly, first with CWDM (they were dreaming about dynamic leaf-to-leaf bandwidth), later with a prewired middlebox (what AWS engineers call ShuffleBox).

Obviously, you’d waste a lot of bandwidth that way, as there are always some leaf switches that do not exchange traffic even though they have a direct link. Plexxi solved that with unequal-cost multipathing (the traffic also uses longer paths, not just direct links); the AWS blog post calls that Routing through Randomness.

As anyone who has tried to understand LFA knows, unequal-cost multipathing only gets you so far. If you want further increases in link utilization, you need “proper” traffic engineering, which requires virtual circuits (and thus an extra layer of encapsulation). Whether you use MAC frames1, MPLS, SRv6, or pigeons for that extra layer does not matter.

How could a prewired ShuffleBox be random? Yeah, that was the first major trigger of my bullshit meter. First, I thought they were using optical switches (which might turn out to be as expensive as traditional spine switches due to lower production volumes), but after reading the article, I got the impression they split the switch uplinks into individual lanes (for example, there are four 100GE lanes in a 400GE uplink port), and prewired the lane-to-lane matrix in the ShuffleBox, which makes it as random as the XKCD random number generator. It’s worth noting that Plexxi did exactly the same thing to get rid of CWDM costs, and that lane splitting is an ancient method we used more than a decade ago to make our lives miserable build larger leaf-and-spine fabrics (some details).

They claim they used optimization methods to find the best partial mesh between N switches having D uplinks. The result is probably optimal (under some constraints) and might look random to a casual observer, but there’s nothing random in it. The arXiv paper correctly calls it a Quasi-Random Graph; that nuance is lost, for obvious reasons2, in the blog posts and similar promotional material.

Could they get better throughput than leaf-and-spine fabrics? In an apple-to-apple comparison, of course not. I explained that ages ago, but of course nobody reads old stuff, so let’s do another simple thought experiment:

You build a leaf-and-spine fabric with N:1 oversubscription on leaf switches – the total bandwidth of edge ports is N times higher than the total bandwidth of uplinks. N is usually set to three.

The spine (or superspine fabric) of your fabric has no oversubscription. The only congested resources are the leaf switch uplinks.

The traffic from any endpoint to any other endpoint in the leaf-and-spine fabric thus has to traverse exactly two leaf switch uplinks plus a non-oversubscribed fabric.

The traffic in the Plexxi or AWS solution might have to traverse more than two leaf switch uplinks (when they use other leaves as relay nodes).

In an environment with many small flows (to make load balancing work well), it’s thus IMPOSSIBLE to get better total throughput in a partial mesh than in a leaf-and-spine fabric with no core oversubscription, and it DOES NOT MATTER what the traffic profile is as long as the leaf switch uplinks are the congestion points. The details are left as an exercise for the curious reader.

But they claim they got better throughput in the arXiv paper! Yeah, I tried to figure that out, but the paper is a bit vague on the details. It looks like they used a simulation to generate the throughput graphs, but the source code is not available, so we can’t know exactly what they did3. Also, they compare their solution to fat trees without defining the parameters of the fat trees they’re using.

I could think of several relatively simple explanations for their results:

The spine layer (or the core fabric) in their fabric is oversubscribed4.

The load balancing in their leaf-and-spine fabric is suboptimal (some uplinks are congested while the others are idle). There are multiple ways to solve this challenge before moving to packet spraying; Cisco ACI supposedly uses one of them, and I wrote several blog posts on the topic in case you’re interested in the details.

They use load balancing across virtual paths in their...

leaf spine rsquo fabric uplinks switches

Related Articles