Frontier Inference Clusters

Marius771 pts0 comments

EtchedSkip to content

Etched

Frontier Inference Clusters

We're building a new category of AI hardware: frontier inference clusters.<br>We co-design chips, racks, software, and manufacturing methods so frontier models can run with best-in-class throughput, latency, cost, and power efficiency for both prefill and decode workloads.<br>Earlier this year our A0 silicon came back from TSMC N4P, and today we are busy validating our first rack-scale product with customers to fulfill $1B in demand.<br>We're a team of 400+ engineers from NVIDIA, Google TPUs, Broadcom, SK Hynix, TSMC, & more. We’ve raised $800M across four unannounced financings, including a strategic investment from VentureTech Alliance. We're excited to deepen our partnership with the world's leading semiconductor manufacturer.

Designing a New Pareto Frontier<br>Our inference systems are built to push the entire pareto curve on frontier models, including many-trillion parameter MoEs, long context, and agentic workloads. This required intense co-design, from new chips, packages, PCBs, cold plates, interconnects, & more. Today, we're sharing two breakthroughs to make this happen:

Low Voltage Inference (LVI) for high-throughput workloads<br>Today, AI chips can't scale FLOPs without thermal throttling. As FLOPs utilization increases, AI chips draw more power and downregulate clock speed. This often results in sustained inference throughput under half of Peak FLOPs.<br>We’ve designed a new architecture to run our chip’s math blocks at under half the voltage of most AI chips. This enables multiple times the FLOPs density of AI chips today. We can run trillion parameter sparse MoEs at 80%+ Peak FLOPs without thermal throttling.<br>Running LVI requires co-designing the entire cluster from the transistor to the token: new splittable math arrays, circuit techniques, novel tiling and scheduling algorithms, power delivery networks, VRM architectures, advanced packaging, cold plate designs, and more.

Cluster Scale Memory (CSM) for low latency workloads<br>Today's AI chips using HBM can’t achieve SRAM-level decode speeds due to memory subsystem and interconnect bottlenecks. SRAM-only chips have lower FLOPs density and memory capacity, sacrificing throughput.<br>We created a much lower-latency shared memory pool across our scale-up domain. We use a proprietary ultra-low-latency, high-bandwidth interconnect to enable dramatically faster memory access across chips.<br>Our HBM/SRAM hybrid design solves both memory capacity and mem2mem latency, enabling high throughput and interactivity simultaneously. CSM improves latency and avoids today's cost, reliability, yield, thermal, and compute tradeoffs of SRAM-only chips, 3D DRAM chips, or optics.

We’ve made co-design decisions hand-in-hand with leading AI companies, cloud providers, and hyperscalers. We’ve tested racks in representative data center deployments, run terabytes of production traffic patterns through our simulator, and had dozens of engineers live overseas for months to co-design deeply with our supply-chain partners. If this sounds exciting, you should join us.

Getting to Gigawatt Scale<br>Early customer tests show us achieving SOTA throughput, latency, and power efficiency on inference workloads. We’ll be sharing more updates on our performance and roadmap this summer.<br>Our first racks ship this summer, and we’ve kicked off production to fulfill over $1B in customer contracts. To enable 24/7 engineering cycles, we’ve opened a Taiwan factory and built a data center, test house, and NPI prototyping lab in our San Jose office.<br>We are vertically integrated to get to Gigawatt scale as quickly as possible. Math block designers sit next to inference engineers, thermal experts next to GSMs.

Get Access→

Gavin Uberti

Co-Founder & CEO<br>Harvard Thiel Fellow. World math champion, Math 55 alumnus, expert in AI compilers, developed the Cortex-M backend for TVM.

Robert Wachen

Co-Founder & President<br>Harvard Thiel Fellow. Co-founded Prod ($100B+ cohort valuation) and Mentor Labs (acq. by Crimson Education).

Mark Ross

CTO<br>Ex-CTO of Cypress (acquired for $9.4B). Shipped 5 systems generating >$1B in revenue, all on A0 silicon.

Brian Loiler

VP of Platform<br>Ex-NVIDIA for 22 years. Led platform engineering teams across NVIDIA. Built the HGX and DGX systems from scratch.

Wayne Cao

VP of Production<br>Led 0-1 production & supply chain ramps for 24 products including the original iPhone, MacBook Air, Pixel, and Chromebook.

Saptadeep Pal

VP of ASIC & Architecture<br>Co-founded Auradine. NVIDIA H100/A100/V100 architecture team. Qualcomm award for research on waferscale SRAM & DRAM stacking.

David Munday

VP of Software<br>Built the TPU software team (TPU v1-v5) and led research for Project Astra at Deepmind.

Tim Perevozchikov

VP of Finance<br>Ex-VP Quant Trading & Chief of Staff to CEO at Two Sigma Securities. Built multiple new trading desks from scratch.

Ajat Hukkoo

Distinguished Engineer at Broadcom & VP of Intel's Custom Silicon Group. Shipped 300...

chips inference latency from frontier throughput

Related Articles