Ultrafast machine learning on FPGAs via Kolmogorov-Arnold Networks | Aarush Gupta
Ultrafast machine learning on FPGAs via Kolmogorov-Arnold Networks
07 Jun 2026
This post is a high-level explainer for my Master’s thesis, which involves designing hardware architectures for ultrafast inference and online learning using the Kolmogorov-Arnold Network (KAN) architecture. I’ll assume familiarity with standard machine learning concepts, as well as some understanding of hardware and digital circuits; read my previous post here for the latter.
Please read the two papers below for more information, particularly for details on benchmarks and notable results.
[FPGA 2026 Best Paper]
Duc Hoang* , Aarush Gupta* , and Philip C. Harris. “KANELÉ: Kolmogorov–Arnold Networks for Efficient LUT-based Evaluation.” Proceedings of the 2026 ACM/SIGDA International Symposium on Field Programmable Gate Arrays. ACM, 2026. https://dx.doi.org/10.1145/3748173.3779202
[ICML 2026]
Duc Hoang* , Aarush Gupta* , and Philip Harris. “Ultrafast on-FPGA Online Learning via Spline Locality in Kolmogorov-Arnold Networks.” arXiv preprint arXiv:2602.02056, 2026. https://arxiv.org/abs/2602.02056
*equal contribution
The case for machine learning on FPGAs
Most modern machine learning workloads, whether training or inference, run on graphics processing units (GPUs). Through hardware architectures that support a highly parallel execution model, GPUs can perform simple operations on large amounts of data with extremely high throughput. This makes them ideal for machine learning problems involving large architectures or batch-style training and inference.
However, complex GPU architectures cannot meet the demands of applications that require ultra-low latency (e.g. sub-microsecond latency) and high hardware efficiency. Processors (e.g. CPUs and GPUs) incur significant overhead from scheduling and optimizing instructions, dynamically accessing memory, and so on. Extremely specialized workloads with ultralow latency (e.g. $\sim$nanoseconds) and efficiency requirements are instead better served by custom hardware accelerators.
Field-programmable gate arrays, or FPGAs, are reconfigurable digital logic devices that are extremely well-suited for this style of custom hardware acceleration. FPGAs contain lookup tables (LUTs), which represent digital functions by enumerating the output value for every combination of binary inputs; flip-flops (FFs), which store state; and other memory and computation primitives. These components and the connections between them are reconfigured to design a custom digital circuit, allowing for low-level hardware architecture and algorithm co-design that enables ultrafast machine learning. Importantly, neural networks are implemented directly as digital logic, rather than as instructions that are sequentially executed on a processor.
Background
Fixed-point quantization
FPGAs and other digital devices fundamentally operate on bits rather than continuous values. However, we often think about arithmetic operations in neural networks (e.g. $\times, +$) as happening over the real numbers $\mathbb R$. We thus need to encode real numbers as bitstrings (sequences of bits), a process known as quantization. Operations like addition and multiplication then become binary functions.
One method for doing this is fixed-point quantization.
Fixed-point quantization represents numbers in base-2, where some bits (fractional bits) come after the decimal point. To illustrate, if we use 8 bits total with 4 fractional bits after the decimal point, we can represent $2^8$ values from $(-2^7) / 2^4 = -8$ to $(2^7 - 1) / 2^4 = 7.9375$, spaced evenly in increments of $1/2^4 = 0.0625$. We will assume here that the representable range is symmetric about zero.
In a fixed-point quantization scheme, we can only represent a discrete set of values in some fixed range, which will lead to approximation error when trying to represent real values. One focus of resource-efficient machine learning is minimizing this approximation error, or quantization error, to enable stable training and inference.
Lookup-table neural networks (LUT-NNs)
FPGAs implement digital logic primarily through lookup tables (LUTs), which are small components that represent arbitrary binary functions by storing their output for each combination of binary inputs. For example, $\text{AND} : \{0, 1\}^2 \to \{0, 1\}$1 is represented with a lookup table
Input ($x,y$)<br>$x\text{ AND }y$
00
01
10
11
It then makes sense to learn these binary functions, represented as lookup tables, as core primitives of a neural network: such a network is known as a lookup-table neural network (LUT-NN). However, learning lookup tables through gradient descent or similar approaches is difficult.
To address this issue, recall that we can learn real-valued functions $f: \mathbb R \to \mathbb R$ through gradient descent. If we perform fixed-point quantization with $b_i$ input bits and $b_o$ output bits,...