Task Failed Successfully: Saturating NIC and Disk Bandwidth | MrCroxx's Blog<br>Home/Posts/<br>Task Failed Successfully: Saturating NIC and Disk Bandwidth<br>#System<br>#TLB<br>#RDMA<br>#Io_uring<br>Table of contents<br>0. “Task Failed Successfully”<br>The AI era has arrived faster than most of us expected. Agentic coding has completely changed the way I work day to day. To be honest, I haven’t written a single line of code at work in quite a while. Yes, it is true. NOT A SINGLE LINE!! And yet, that hasn’t stopped the code from running across clusters with hundreds of HPC servers at peak performance.<br>Of course, not writing code (or even not fully reviewing it) does not mean we are just randomly poking around, like monkey typing. We still need to analyze requirements, refine the design with the agent, build demos, run mock experiments, study the results from small-scale tests, iterate on the problems we find, and maintain a complete, solid testing process, blah blah blah.
Monkey Typing
However, with AI and agentic coding, everything has become faster. Sometimes, code is churned out faster than we can fully understand it. And sometimes, it is even faster than AI can understand it. Yes, you read that right. And this post comes from one such example.<br>After I gave my agent the prompt to optimize the performance of my system, the AI quickly took it from roughly half throughput to full saturation. But its explanation of why it worked was completely wrong. It was a classic case of task failed successfully .
Task Failed Successfully
This post doesn’t talk about why the AI “failed successfully”. It is a walkthrough of the analysis and debugging process behind this system performance optimization.<br>1. Optimize a Demo with 1 NIC and 8 disks<br>Let’s turn the system into a simple abstraction to focus on the performance optimization rather than the complex business:<br>A single thread issues 1 MiB random direct I/O reads across 8 NVMe drives, then sends the data to a remote host via RDMA WRITE. Now, saturate the NIC bandwidth.<br>More specifically, each drive can deliver up to 7 GiB/s of read throughput, and the NIC provides 400 Gb/s of network bandwidth. All devices are attached to the same NUMA node. The worker thread is pinned to a non-CPU0 core. The host runs with the IOMMU in passthrough mode, and none of the I/O devices involved are translated through the IOMMU.<br>For the implementation, I (actually it was my AI agent) built a very simple event loop: the client sends read requests to the server; the server polls the RDMA CQ for incoming requests, submits the reads through io_uring, polls for the resulting CQEs, and then sends the data back via RDMA WRITE.
Simple Demo Topology
This demo setup is very straightforward and rules out almost all sources of interference. Other than the NIC we are trying to saturate, every component has plenty of headroom: the NIC’s theoretical maximum throughput is 46.6 GiB/s, each drive averages less than 6 GiB/s of read throughput, total IOPS stay below 50,000, and the CPU has more than enough capacity as well.<br>Now that everything is in place, let’s look at the results.<br>inflightGiB/savg µsp50 µsp90 µsp99 µsp99.9 µs48.87440430519632759815.5650148661678495516 22.69 688670850111814363222.53138613841696193422726422.1528212819313333663691<br>Surprisingly, the system already hits a bottleneck at an I/O depth of just 16, with aggregate throughput reaching only about half of the NIC’s bandwidth. And the CPU utilization reached 100%.<br>It was clear that there must have been something wrong, so I profiled the system with perf at an I/O depth of 16. Here is the flamegraph.
Simple Demo Flamegraph (iodepth=16)
As the flamegraph shows, most of the CPU time is spent in io_submit_sqes, which accounts for 81.62% of the total CPU cost. Because the demo uses Direct I/O, every I/O submission requires the kernel to construct DMA metadata from the user-space buffer for the block device to consume. The most costly parts of this path are:<br>__bio_iov_iter_get_pages: Turn iov into bio pages.pin_user_pages_fast: Translates a user-space virtual address range into an array of struct page pointers, and pins those pages so they cannot be reclaimed, migrated, or swapped out while the device is performing DMA.
bio_set_pages_dirty: Mark the buffer pages dirty. With Direct I/O, the NVMe device DMA-writes data directly into the pages backing the user-space buffer. Those pages must then be marked dirty so that the VM does not treat them as clean pages.<br>folio_*: It updates VM state associated with the folio, including its reference count, dirty state, mapping, locking, and reclaim-related state. In the Linux VM, a folio is a unified abstraction for a physically contiguous set of pages.<br>In a word, the wide frame of io_submit_sqes represents the cumulative cost of preparing user memory for Direct I/O DMA. Each SQE contains only a user-space pointer and length. The kernel must walk the page tables, find and pin the...