Elijah Glover - What I Learned Building 8 Tbps of CDN<br>Latest writing ›<br>What I Learned Building 8 Tbps of CDN
Spain won UEFA Euro 2024 at 7am on a Monday in Australia. A national audience watching on phones mid-commute, on transit networks, with nowhere else to go if the stream broke. For five years I ran content delivery at Optus Sport. Two engineers in the core team, 8 Tbps of capacity, four CDNs behaving as one, and a 1.5 million request per minute load test we ran on ourselves on purpose. When the stream is the only way to watch, the stream is the product.<br>Two engineers, four CDNs, 8 Tbps<br>I focused on making four CDNs behave like one, not picking a winner. Australia has a structural capacity shortage for streaming delivery, and on a World Cup night the answer should never be one CDN doing everything. The instinct is to find the “best” CDN; at scale that question stops being the right one.<br>I built central traffic routing and blocklist systems, normalised across our in-house build (EPYC CDN, named after the AMD EPYC silicon it ran on, with HAProxy and Varnish under the hood), Fastly, CloudFront and Akamai. One config surface, consistent behaviour, capacity drawn from wherever it sat. The audience was Australia-only, every byte landed on an Australian eyeball. EPYC CDN carried the bulk of traffic. The commercial CDNs gave us three independent ways to deliver the same stream to the same country when we needed them.<br>The lesson: at scale you stop picking a CDN and start running a content delivery portfolio.<br>CAPEX beats OPEX when you own the pipe<br>Finance made the CAPEX over OPEX call. The build I shipped under it earned a CTO commendation for cost efficiency. Spend on hardware once, sit it on a network we already ran, and a recurring CDN bill turns into capacity we own. Optus is a telco. They own the network. As a Tier 1 carrier with presence at every major interconnect and exchange where traffic flows in Australia, peering inside the country changes what’s economic to build in-country. The target was quality of service first, and owning the input let us tune for it directly.<br>Quality of service (QoS) is what viewers feel, reported as stream starts, rebuffering, and bitrate held. Mux is how we measured it. EPYC CDN gave Optus Sport headroom on the numbers that matter at scale: faster starts, fewer rebuffers, steadier bitrate at the moments the entire audience is locked on the same frame. Sam Kerr’s goal in the World Cup semi. Italy winning the EURO 2020 final on penalties. Those are the seconds the platform is judged on.<br>What you build at that scale becomes brand-defining infrastructure. The FIFA Women’s World Cup 2023 was the catalyst for a new 400G Metro Core that the Networks team stood up behind the in-house EPYC CDN. The tournament drew a 1.2 million peak concurrent audience across the Seven and Optus co-broadcast and 11 million viewers for the semi across streaming and broadcast. The platform holding through every one of those moments is part of what those results rest on. The same network now sits under every workload that runs on top of it.<br>The lesson: return on investment isn’t only the balance sheet. Stream quality on the night and innovation across the org are returns still paying off today.<br>Building one and operating one are different disciplines<br>We were better at running Fastly, CloudFront, and Akamai because we built EPYC. Operating and building are different disciplines, and doing one sharpens the other. Designing the cache yourself, memory and NVMe tiers, sharding across nodes, and a mid-tier shield in front of origin, means you understand why architectures behave the way they do under load. You can debug looking under the hood, not just at the surface.<br>Building also means understanding NICs, IRQ affinity, TCP tuning and NUMA domains. The substrate decides what the cache can actually do under load, and once you have tuned it yourself the behaviour of any CDN stops being a black box.<br>Operating teaches breadth, building teaches depth. Knowing why a cache behaves the way it does under load is what lets you tell when a vendor is the right tool, when it isn’t, and how to hold them to it.<br>The lesson: build one and you stop taking the others on faith.<br>DDoS’d ourselves at 1.5M req/min on purpose<br>1.5 million requests a minute of sequential video segments, sustained against a single EPYC node, the kind of load that to a normal engineer reads as a DDoS in progress. I built the load test. We ran it on ourselves, on purpose. A platform that requires everything to be working in order to work is not a platform.<br>Multi-CDN, redundant capacity, origin shields, fallback paths. None of it is glamorous and none of it shows up in a feature list. It’s the difference between a night where people talk about the football, and a night where they talk about the stream.<br>The network under the CDN was never static. Peering shifted, paths changed, capacity moved. We ran...