FlashLib: Bringing Flash Magic to Classical Machine Learning Operators

matt_d1 pts0 comments

FlashLib: Bringing Flash Magic to Classical Machine Learning Operators

FlashLib: Bringing Flash Magic to Classical Machine Learning Operators

Shuo Yang1, Haocheng Xi1, Yilong Zhao1, Qiuyang Mang1, Zhe Wang2, Shanlin Sun2, Kurt Keutzer1, Joseph E. Gonzalez1, Song Han3, Chenfeng Xu4,*, Ion Stoica1,*

1UC Berkeley · 2UC Irvine · 3MIT · 4UT Austin · *Co-advising

Code: github.com/FlashML-org/flashlib

26&times;<br>KMeans

19&times;<br>KNN

208&times;<br>TruncatedSVD

47&times;<br>PCA

7&times;<br>UMAP

40&times;<br>HDBSCAN

147&times;<br>t-SNE (exact)

49&times;<br>MultinomialNB

Introducing FlashLib — a GPU library for classical machine learning operators on modern hardwares, rebuilt for today's ML workloads and emerging agentic AI systems. Here are a few headline results from the first release:

Significant wins over cuML on Hopper GPUs: up to 26&times; on KMeans, 19&times; on KNN, 40&times; on HDBSCAN, 208&times; on TruncatedSVD, 47&times; on PCA, 147&times; on exact t-SNE, and 49&times; on MultinomialNB.

Flash informative API: Predict runtime, memory footprint, and overhead for any workload in ~5 &micro;s on pure CPU , with no GPU profiling required.

Fast cold start, built to scale: FlashLib uses heuristic kernel selection to avoid long autotune loops, and already supports multi-GPU execution for large workloads.

Toward optimal hardware utilization: FlashLib drives kernels much closer to the limits of modern GPUs, with Flash-KMeans reaching up to 61% of peak FLOPs and Flash-KNN reaching up to 85.2% of peak HBM bandwidth on H200.

The next frontier of AI efficiency is not just faster LLM inference. It is faster intelligence assembly. For the past few years, MLsys work largely followed a model-centric view of intelligence. As LLMs became stronger through better reasoning, larger-scale test-time compute, and more capable inference, the systems community focused on making the transformer core faster: FlashAttention, FlashDecoding, KV-cache management, and LLM serving systems etc.

But the rise of agentic AI is changing the bottleneck. Modern intelligence is increasingly built around the base model through tools, harnesses, retrieval, verification, search, and orchestration. The LLM is no longer merely a standalone reasoner; it becomes a controller over a broader computational system. As a result, the performance bottleneck is no longer confined to transformer inference. It extends to the entire computational substrate surrounding the model. For example, in Agentic AI for Science, LLM agents may generate hypotheses or candidate solutions, but the surrounding loop often depends on search, clustering, nearest-neighbor retrieval, PCA, SVD, and other classical ML operators for verification and feedback. In multimodal generation and physical AI, models must increasingly process, compress, retrieve, and reorganize streaming features on the fly before they enter the model. These examples point to a broader shift: classical ML operators are becoming core primitives around the LLM model. We envision future agentic workflows where clustering, retrieval, dimensionality reduction, verification, and linear algebra are no longer offline utilities, but online primitives in the critical path of intelligence assembly. Figure 1 illustrates this shift.

KMeans<br>KNN<br>TruncatedSVD<br>PCA<br>HDBSCAN

Five classical ML operators migrating from batch latency tier into millisecond serving tier over a decade, with refined labels<br>Same latency chart as before, with two label refinements: Video generation is now Streaming video generation, and PCA-based KV compression is shortened to PCA-based compression.

K-means

k-NN

TruncatedSVD

PCA

HDBSCAN

1 ms<br>10 ms<br>100 ms<br>1 s<br>1 min<br>1 hr

2015<br>2018<br>2021<br>2024<br>2027

Year operator entered this latency tier

User segmentation<br>Feature reduction<br>Topic modeling<br>Doc clustering<br>Item-item recsys

Pipeline PCA<br>Embedding compress<br>Topic discovery<br>RAG retrieval<br>Semantic cache

PCA-based compression<br>SVD-based compression<br>Streaming video generation<br>KV cache clustering<br>Agent tool routing

Figure 1. The latency budget for classical ML operators (KMeans, k-NN, TruncatedSVD, PCA, HDBSCAN) has been falling steadily over the past decade, on a log scale. The same primitives that used to run offline at the minute-to-hour tier (user segmentation, topic modeling, batch feature reduction) are now being called inside online serving paths (RAG retrieval, semantic cache, KV-cache clustering, agent tool routing) where the budget is measured in milliseconds. As this trend continues, the systems community needs implementations of these operators that are fast, hardware-efficient, reliable, and numerically faithful enough to sit in the critical path. Hover (or tap) any point to see the specific work it represents.

However, the underlying implementations of these classical operators have not kept pace with this shift. Their core design assumptions still come from the pre-FlashAttention, pre-Hopper, pre-agent era, which creates...

times operators classical flashlib flash kmeans

Related Articles