You can't train in what the model already knows: the case against "ImageNet for C++" - HFT University
LIVE SUBMISSIONS — 3 DAYS
You can't train in what the model already knows: the case against "ImageNet for C++"
Published: June 16, 2026
I Benchmarked the AI's "Fast" C++. It Wasn't Faster.
A few days ago I showed that adding "make it as fast as possible" to a C++ prompt roughly doubles the memory-safety violations in what four frontier models hand back. The latency sentence makes the model drop std::span and walk a raw pointer by hand, exactly the construct the C++29 bounds profile exists to ban.
The obvious objection landed in my inbox within the hour, in several flavors of the same sentence: fine, but the fast version is faster. That is the trade. You buy speed with safety, and on the hot path you take that trade every time.
It is a reasonable thing to assume. It is also wrong, and I can show you the cycle counts. The raw pointers did not buy the speed. Something else did, and that something is fully available with the bounds intact. So the trade everyone thinks they are making does not exist: the unsafe version is not a faster version, it is just an unsafe version.
The one task where this is not obvious
Summing a contiguous block of doubles is the right test, because it is the one case where the naive safe loop genuinely is slow, and for a real reason. Floating-point addition is not associative. (a + b) + c is not bit-identical to a + (b + c), so a compiler is not allowed to reassociate a plain sequential sum without your permission. That means this:
double sum(std::span d) {<br>double total = 0.0;<br>for (double x : d) total += x; // serialized on FP-add latency<br>return total;
compiles, at -O3, to a single scalar accumulator chained through a 3-to-4 cycle floating-point add. Every iteration waits for the previous one to finish. It is bounds-safe, it is readable, and on a Zen 2 core it runs at about 0.97 ns per element no matter what the cache is doing. It is slow, and the models are not wrong to avoid it.
When you ask Claude or Gemini or GPT for the fast version, they all reach for the same idea: break the dependency chain with multiple accumulators so the out-of-order engine can keep several adds in flight. Here is what Gemini's latency answer does, lightly trimmed:
double fast_sum(const double* data, std::size_t size) { // raw pointer<br>double s0=0,s1=0,s2=0,s3=0,s4=0,s5=0,s6=0,s7=0;<br>std::size_t i = 0, lim = size & ~std::size_t(7);<br>while (i
Eight accumulators, raw pointer indexing, no bounds anywhere. It is fast. The question is whether the raw pointer is doing any of the work.
It is not. Here is the same thing, bounds-safe
Take that exact algorithm and write it over a std::span. Same eight accumulators, same unrolling, but the data is carried with its length and nothing indexes past it:
double safe_fast8(std::span d) {<br>double a[8] = {0,0,0,0,0,0,0,0};<br>std::size_t n = d.size(), i = 0, lim = n & ~std::size_t(7);<br>for (; i
I put both of those, plus Claude's four-accumulator version, GPT's hand-written AVX2, the naive loop, and std::reduce, into one benchmark and ran it on hz2, a Ryzen 9 3900 with the test core isolated at boot (isolcpus + nohz_full, taskset -c 6, performance governor). g++-13, -O3 -march=native, median of 31 trials, every implementation verified to produce the same sum before any timing started. Array sizes from 512 doubles (lives in L1) up to 32 million (lives in DRAM).
Read the green line and the red line. safe_fast8, the bounds-safe span, and Gemini's raw-pointer fast_sum are not close, they are identical to the digit at every single size: 0.11 and 0.11 in L1, 0.12 and 0.12 in L2, 0.31 and 0.31 in DRAM. Same algorithm, span versus pointer, same machine code, same cycles. The bounds safety costs exactly nothing. The naive loop sits up at 0.97 the whole way, four to eight times slower, and Claude's unsafe four-accumulator version is actually worse than the safe eight-accumulator one because four accumulators do not fill the pipeline.
The speed never lived in the raw pointer. It lived in the eight accumulators, which is an algorithm choice, not a safety choice. You can have the dependency-breaking and keep the span. The models bundled the two together because in their training data "fast loop" and "raw pointer" co-occur, not because one needs the other.
GPT's hand-written AVX2 is the one thing that beats safe_fast8, by about 2x, but only while the data is L1-resident, which for the buffer sizes this code targets it rarely is. By L3 it has converged with the safe version, and by DRAM everything bandwidth-bound sits together at 0.31. And even that L1 win is reproducible safely: a span with sixteen accumulators autovectorizes to the same AVX2. GPT hand-rolled an intrinsic that the compiler would have written for it, and the only thing the hand-rolling bought was a reinterpret_cast and 125 lines of macros.
One more line worth your attention. Rebuild the naive safe loop with...