Why my SIMD code was silently running as scalar, and what debugging it taught me about production environment assumptions
Christopher
SubscribeSign in
Why my SIMD code was silently running as scalar, and what debugging it taught me about production environment assumptions
Christopher<br>Jun 04, 2026
Share
Authors Note: This is the first of many articles I will write detailing the creation of Metis; a trading engine I developed in hopes of answering this question: “Can understanding how energy flows allow us to better understand the many influencers that make up the trade price?”<br>When you optimize code for speed, you assume the optimization actually compiles.<br>I learned this the hard way. I learned you don’t just ship the code, you ship the code + system configuration + hardware profile + OS defaults.<br>THE ANOMALY<br>A few weeks ago, I was benchmarking a trading system core written in Rust. The system uses vectorized operations—SIMD intrinsics to process 1024 floats per calculation in parallel. On paper, the math is solid: AVX2 should process 8 floats per cycle with fused multiply-add, giving me roughly 7-8x speedup over scalar code.<br>What I got instead: 0.34x speedup. The scalar version was faster.<br>SIMD result: 3.199677<br>Scalar result: 3.199682<br>SIMD time (10000 iterations): 13.6771ms<br>Scalar time (10000 iterations): 4.6213ms<br>SIMD speedup: 0.34x The results were identical (within floating point error), so correctness wasn’t the issue. The SIMD code was just... slower. This shouldn’t be possible.<br>THE LAYERS OF WRONG<br>The obvious suspects came first:<br>Is the compiler optimizing it away? No—black_box() prevents that.<br>Is the algorithm wrong? No—horizontal sum is standard, FMA is correct.<br>Is the CPU too old? No—Intel Core Ultra 7 155U has AVX2.<br>Is the data too small? No—1024 floats is plenty.<br>Then I checked the binary itself.<br>objdump -C -d target/release/metis_core.dll | grep “vhaddps|vfmadd”<br>Nothing. Zero AVX2 instructions in the compiled binary. The intrinsics weren’t there.<br>THE DEBUG CASCADE<br>This is where it gets interesting. The code explicitly uses unsafe blocks with _mm256_fmadd_ps, _mm256_hadd_ps, and other intrinsics. Rust compiled it without error. But the binary had no AVX2 instructions.<br>This meant one thing: the compiler was silently falling back to scalar code even though it was marked unsafe.<br>The code was syntactically valid SIMD. It just wasn’t actually executing SIMD.<br>This is the kind of bug that doesn’t fail loudly. It compiles. It runs. It’s just wrong. This is the failure mode that stays hidden indefinitely. The code doesn't tell you, the tests don't tell you, only measurement tells you. The code would just be 3x slower than it should be, and I thought the algorithm was to blame.<br>THE ROOT CAUSE<br>The fix required understanding Rust’s CPU feature model.<br>Rust doesn’t assume your CPU supports AVX2. When you write unsafe { _mm256_fmadd_ps(...) }, you’re telling Rust “trust me, this CPU has AVX2.” But if you don’t also tell the compiler to assume AVX2 is available, rustc compiles it in a way that’s safe for CPUs without AVX2—and that “safe fallback” is scalar code.<br>The fix: .cargo/config.toml<br>[build]
rustflags = [”-C”, “target-feature=+avx2,+fma”, “-C”, “target-cpu=native”]
[profile.release]
opt-level = 3
lto = true
codegen-units = 1
This tells the compiler: “Assume this CPU has AVX2 and FMA. Compile accordingly.”<br>After adding this file and rebuilding:<br>Scalar time (10000 iterations): 11.6166ms (462ns per iteration)<br>SIMD time (10000 iterations): 1.6043ms (160ns per iteration)<br>SIMD speedup: 7.24x The benchmark didn’t change. The code didn’t change. Only the compiler flags changed. But now the intrinsics were actually being compiled.<br>WHAT C WOULD HAVE TOLD ME<br>In C, targeting AVX2 without -mavx2 either fails to compile or produces a clear warning depending on how you've written it. The compiler has a closer relationship with hardware targets. Rust's safety model abstracts this in ways that are usually beneficial but occasionally produce exactly this failure mode — syntactically valid, semantically correct, silently wrong at the systems level. The unsafe block signals "I know what I'm doing" to the borrow checker, but it doesn't signal anything to the code generation backend about what hardware features to assume.<br>I’m someone who always defaults to stricter languages when possible. I prefer TypeScript over any frontend framework, and coming from a C\C++ background, am pretty used to very loud failures from the terminal. This was a very interesting case however, as I do enjoy Rust’s panics and checks compared to just compiling C with fingers crossed. Rust's unsafe doesn't mean 'throw away all guarantees.' It means 'I'm handling memory safety manually.' The hardware feature assumption is a different layer entirely, and that layer stays silent. I learned that every language has a set of assumptions it makes silently, and those assumptions become production surprises.<br>CROSS-PLATFORM IMPLICATIONS<br>With the fix...