AWS: Randomness instead of hierarchy in the data center | heise online
heise+ entdecken
SuchenAbo
Suchen
Alle Magazine im Browser lesen<br>Newsletter<br>heise-Bot<br>Push-Nachrichten
${lead}
${lead}
${content}
${content}
${content}
${content}
Advertisement
Advertisement
Previous data center networks are mostly built hierarchically in Clos or Fat-Tree architectures with Leaf/Spine or, for larger installations, with Super Spine. While this creates deterministic paths with shortest paths, it requires significantly more active components in the higher hierarchy levels than would be necessary if the source and destination communicated directly. At the same time, there should be randomness in load balancing to avoid overloading individual links with large data flows. Researchers from AWS have developed a new network architecture at the optical routing level called Resilient Network Graph (RNG) to address these disadvantages.
Continue after ad
Background to the Clos architecture
In Clos architectures, a packet from a switch/router travels up the hierarchy until it reaches a higher hierarchy level that knows a path to the destination and then travels back down the hierarchy to the destination. A Clos architecture therefore requires far more than just the routers connecting the servers (Leafs). The components in the higher levels (Spines/Super Spines) are also more frequently overbooked than in a flat architecture where the switches/routers are directly interconnected with the connected end devices. However, such direct interconnection has so far been considered unrealistic.
To achieve optimized data distribution in the network, AWS even describes a random interconnection of connections between routers with ad-hoc paths as optimal, as researchers had already calculated in the 1990s. However, this failed because it was too computationally intensive and very complex in terms of cabling. In this flat hierarchy, every router would be equal, and individual failures would always affect only a small part of the communication relationships, instead of creating hotspots in higher hierarchy levels.
Hierarchical network (left) versus flat network with arbitrary interconnections (right).
(Image: Amazon)
Solution Approach: Quasi-Randomness
As a solution, Amazon proposes a flat network design called Resilient Network Graph (RNG), which is based on passive optical elements called Shuffleboxes to achieve “quasi-randomness.” Each router is simply connected to any R-port (Router Port) of the Shufflebox. According to Amazon, the optical insertion loss of the Shuffleboxes is manageable with current optics.
Continue after ad
Three server rooms (dashed squares), each containing two Shuffleboxes (trapezoids). On one side, each Shufflebox is connected to servers (yellow circles); on the other side, the Shuffleboxes are only connected to each other.
(Image: Amazon)
This design is used in most new Amazon environments, first implemented in Dublin at the end of 2024, and has been the standard architecture in most new AWS data centers worldwide since April 2026. According to research findings, it offers several savings. Amazon cites 69 percent fewer routers, up to 33 percent better throughput, and 40 percent energy savings for network components. Overall, this results in total cost savings of between 9 and 45 percent, depending on the overbooking rate.
Spraying instead of deterministic routing
To ensure multipath routing in this network, AWS has designed the distributed link-state routing protocol Spraypoint. The source router “sprays” its data traffic randomly to all its neighbors. Each (destination) router has specific “waypoints” through which data traffic is routed to it. The basic principle is that each data packet sent by the source router is sent to a randomly selected neighbor; subsequently, the classic shortest-path algorithm routes it to a waypoint, which finally forwards it to the destination. The initial spraying is intended to prevent overload on individual paths. The complete paper and the blog post by the Amazon researchers are publicly available. Amazon did not provide information on the router manufacturers used.
Videos by heise
mehr Videos
c't 3003
heise & ct
Peertube
Conclusion
Amazon offers an insightful look into its data center architecture. However, as the routing protocol is not yet publicly available and details about the Shufflebox construction are missing, it will initially remain an isolated solution within Amazon's data centers. It will be interesting to see if the solution is also suitable for optimizations in AI backend networks, which the paper already mentions as the next testing step. Furthermore, it is to be hoped that Amazon will make the routing protocol publicly available.
Mehr anzeigenWeniger anzeigen
(vbr)
Don't miss any news – follow us on<br>Facebook,<br>LinkedIn or<br>Mastodon.
This article was originally published in
German.
It was translated with technical assistance and...