AI Datacenters Were Built for GPUs. What Happens When You Remove the GPUs?

AI Datacenters Were Built for GPUs — Almartis

🌙

☀️

The old model

For the past few decades, building a datacenter has been a well-understood, predictable exercise in utility engineering. You provisioned compute servers, attached storage arrays, and built a network to stitch them together. The objective was straightforward: maximize utilization while minimizing cost.

The dominant traffic pattern was fundamentally north-south (clients sending requests to servers, and servers responding with database queries) and a few east-west traffic from servers to storage. The networks were built to handle bursty traffic, and if a packet dropped, standard TCP/IP would retransmit it. In web hosting or cloud services, a minor delay meant an image loaded slightly slower or a request completed a few milliseconds later. It was tolerable.

AI training changed that model completely. The network is no longer infrastructure. It directly determines accelerator utilization.

The AI shift

In modern AI clusters, the network is no longer just infrastructure sitting beneath compute. It is not simply transporting data between machines but determines accelerator utilization.

If you are training large models under the deep learning paradigm, you aren't dealing with independent servers. It is rather a massive, distributed supercomputer where thousands of GPUs must continuously swap parameters. The dominant traffic pattern shifts completely to east-west traffic (server-to-server, GPU-to-GPU and rack-to-rack) communication inside the cluster. In contrast to localized, bursty spikes, AI workloads execute communication patterns like all-to-all and all-reduce.

Instead of millions of small independent flows, the network must carry a small number of extremely large elephant flows. During gradient synchronization phases, thousands of GPUs may simultaneously exchange data across the fabric, creating severe network incast and rapidly saturating switch buffers.

This shift broke many of the assumptions standard networking was built on. When a modern accelerator can consume and generate data at 800 Gb/s, the critical metric flips from average latency to Job Completion Time (JCT) and tail latency.

In deep learning training, workloads execute in tightly synchronized steps. Meaning the entire workload progresses at the speed of the slowest participant.

One delayed packet can stall thousands of GPUs.

Figure 1: Synchronized elephant flows causing switch buffer saturation.

RDMA & the PFC trap

Solving packet loss created a new problem: head-of-line blocking.

The sensitivity to packet delay is amplified by the transport layer AI clusters rely on. Modern distributed training heavily uses RDMA through RoCEv2 (RDMA over Converged Ethernet), allowing GPUs to bypass the CPU and operating system entirely for low-latency direct memory access across GPUs. But while RoCEv2 dramatically reduces overhead, it is also highly sensitive to packet loss. A single dropped packet can trigger retransmissions, timeout cascades, and synchronization delays across the cluster.

To achieve loss tolerance , standard RoCEv2 networks rely on Priority Flow Control (PFC). Conceptually, PFC acts like a pause mechanism: when switch buffers begin filling, the switch instructs upstream devices to temporarily stop transmitting traffic.

But this creates another problem: head-of-line blocking .

PFC solves packet loss by propagating congestion backward through the network. Under sustained load, this creates head-of-line blocking, where unrelated traffic becomes trapped behind congested flows. Congestion spreads across the fabric, queue depths increase, and entire sections of the network can become effectively synchronized around the slowest traffic path.

In distributed training environments, this is expensive. The compute cluster cannot advance until every synchronization operation completes. GPUs remain idle while waiting for retransmitted packets or congested flows to clear.

InfiniBand & rail optimization

The incumbent answer: InfiniBand and Rail Optimization

To maximize GPU utilization, the industry's immediate answer was to throw hardware at the problem. NVIDIA capitalized on this by dominating the AI datacenter landscape with InfiniBand — a native lossless fabric designed specifically for high-throughput, low-latency clustering. Unlike conventional Ethernet deployments, InfiniBand was built around deterministic transport behavior, hardware congestion management, adaptive routing, and tightly controlled latency characteristics.

To scale these clusters, engineering teams have had to navigate three distinct network vectors:

Scale Up: Maximizing the high-speed interconnectivity within a single chassis or node (e.g., stitching 8 GPUs together using NVLink).

Scale Out: Expanding horizontally by connecting these multi-GPU nodes across an entire data hall using a dedicated backend network fabric.

Scale Across / DCI (Datacenter Interconnect): Linking entire clusters together...

AI Datacenters Were Built for GPUs. What Happens When You Remove the GPUs?

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

SpaceX not the behemoth everyone thought

The Mirror Is Part of the Machine

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits