Zero-Copy Data Movement from NIC to GPU at 100s of Gbps

rgurunathan1 pts0 comments

DAQIRI โ€” Data Acquisition for Integrated Real-time Instruments - DAQIRI โ€” Data Acquisition for Integrated Real-time Instruments

Initializing search

nvidia/daqiri

API Reference

Tutorials

NVIDIA Open Source<br>Data Acquisition

DAQIRI Connects Sensor Data to the NVIDIA Compute Ecosystem

DAQIRI (Data Acquisition for Integrated Real-time Instruments) moves high-bandwidth data between external<br>sensors and GPU, CPU, or storage devices. Streams can arrive from PCIe devices such as FPGAs or from network-capable sensors<br>over Raw Ethernet (UDP/TCP) or RoCE/RDMA, giving applications one zero-copy path for ingest and egress.<br>DAQIRI not only accelerates data movement and storage at the instrument but can also be used to connect sensor data to HPC and Cloud systems.

Quick Start โ†’<br>Examples<br>Tutorials

PCIe + Ethernet<br>Sensor Paths

Ingest + Egress<br>Data Direction

Zero-Copy<br>CPU/GPU Memory

Raw Ethernet, RoCE<br>Protocols

C++ / Python<br>Application API

Why DAQIRI

Closing the Gap Between Sensor and GPU

Scientific and industrial instruments generate data that is richest at the source โ€” before it is filtered, decimated, or summarized. DAQIRI places NVIDIA GPU hardware directly in that data path, forging a tight bond between upstream sensors, their data converters, and the NVIDIA compute ecosystem. The result is a new foundation for developers: the ability to work with instrument data in its rawest form, at wire speed, and to build a new class of autonomous experiments where AI can observe phenomena directly at the source, augment human analysis, and steer experiments in real time. Stream data into and out of GPUs efficiently while leveraging common tensor-compute libraries.

Scalable, High Throughput

Hundreds of gigabits per second with proper hardware and CPU/NUMA tuning. Direct access to NIC ring buffers keeps latency at PCIe transit time only.

๐Ÿš€

GPUDirect Zero-Copy

Two GPU receive modes: Header-data split (headers to CPU, payload to GPU โ€” recommended) and Batched GPU (entire packets to GPU for maximum bandwidth).

๐Ÿ”€

Hardware Flow Steering

Route packets based on header matching to steer different streams to different GPUs or CPUs โ€” entirely in NIC silicon, before any software runs.

๐Ÿ”—

RDMA over Converged Ethernet

Run RDMA READ, WRITE, and SEND over standard Ethernet via RoCE โ€” no specialized InfiniBand fabric required. The same libibverbs API also supports InfiniBand for environments where it is available.

๐Ÿ“„

YAML-Driven Configuration

Define memory regions, NIC interfaces, TX/RX queues, and flow rules in a single YAML file โ€” or build the same config in C++ code. Switch stream types, memory kinds, and buffer sizes without recompiling.

๐Ÿ“ฆ

Containerized Deployment

A ready-to-run container bundles all userspace dependencies including a dmabuf-patched DPDK โ€” no host-side dependency setup, no peermem kernel module. From docker pull to running benchmarks in minutes.

Quick Start

Build & Run in Minutes

Runs on Linux (kernel 5.4+) with the CUDA Toolkit 12.2+. The kernel-bypass and GPUDirect paths additionally require an NVIDIA ConnectX-6 Dx (or newer) NIC.

Full Guide โ†’

Install Prerequisites

Install the CUDA Toolkit (12.2 or newer).

For the Raw Ethernet / GPUDirect / RoCE path, you also need an NVIDIA ConnectX-6 Dx (or newer) NIC. The default Ubuntu kernel drivers are sufficient; we recommend additionally installing doca-ofed for the diagnostic utilities (ibstat, ibv_devinfo, mlxconfig, mlnx_perf, โ€ฆ).

Build from Source

Select optional engines with DAQIRI_ENGINE. Valid values: dpdk, ibverbs. Linux sockets are always built in.

# Configure, build, install<br>cmake -S . -B build \<br>-DBUILD_SHARED_LIBS=ON \<br>-DDAQIRI_BUILD_PYTHON=OFF \<br>-DDAQIRI_ENGINE="dpdk ibverbs"<br>cmake --build build -j<br>cmake --install build --prefix /opt/daqiri

Or Build the Container

The Dockerfile builds DPDK from source with dmabuf patches โ€” no peermem needed inside the container. Set BASE_IMAGE=torch to build on top of NGC PyTorch for Torch / TensorRT inference workflows.

BASE_TARGET=dpdk \<br>DAQIRI_ENGINE="dpdk ibverbs" \<br>scripts/build-container.sh

Tune the System

Run the diagnostic script to surface common networking bottlenecks (CPU governor, hugepages, MRRS, NUMA, GPU clocks, MTU, BAR1, PCIe topology):

sudo python3 python/tune_system.py --check all

Run a Benchmark

Edit the YAML to match your hardware (PCIe BDF, CPU cores, IPs), then:

./build/examples/daqiri_bench_raw_gpudirect \<br>examples/daqiri_bench_raw_tx_rx.yaml \<br>--seconds 10

Initialize & Receive PacketsC++<br>#include

// Init from YAML config<br>daqiri::daqiri_init("config.yaml");

// Non-blocking burst receive<br>daqiri::BurstParams *burst;<br>auto s = daqiri::get_rx_burst(<br>&burst, port_id, queue_id);

if (s == daqiri::Status::SUCCESS) {<br>int n = daqiri::get_num_packets(burst);<br>for (int i = 0; i void* p = daqiri::get_packet_ptr(<br>burst, i);<br>// process p ...<br>daqiri::free_all_packets_and_burst_rx(<br>burst);

Header-Data Split (GPU payload)C++<br>// Seg 0 = headers (CPU)<br>// Seg 1 = payload...

data daqiri build from nvidia ethernet

Related Articles