My Rust SIMD code was silently running as scalar Part 2

Why My Windows Benchmarks Were Lying — CPU Pinning, Power Caps, and What Variance Actually Tells You

Christopher

SubscribeSign in

Why My Windows Benchmarks Were Lying — CPU Pinning, Power Caps, and What Variance Actually Tells You Part 2 of building Metis: When optimizing the code isn't enough, you have to fight the operating system

Christopher Jun 08, 2026

In the first part of this series, we looked at a silent failure mode in Rust: how calling SIMD intrinsics without explicitly enabling the hardware target flags causes the compiler to silently fall back to scalar execution. Fixing the .cargo/config.toml yielded the expected 7x speedup. But fixing the code only revealed a deeper problem with the system. Even with the AVX2 instructions firing perfectly, running the exact same binary on two different environments on the exact same laptop produced a glaring discrepancy: WSL2 was running the math 35% to 92% faster than native Windows. Worse than the speed difference was the variance. In WSL, execution times varied by about ±15%. On native Windows, the scalar baseline varied by a massive ±125%. Noise is 5%. When your execution time swings by 125% in a tight loop, you aren’t looking at measurement error. You are looking at a system actively fighting your code. THE INTERVENTION: THREAD PINNING

Most engineers see a 125% variance in a benchmark and try to smooth it out. They run more iterations, drop the outliers, and average the rest to find the “true” number. But variance is a diagnostic signal. A ±125% swing means the physical execution environment is changing mid-flight. The suspect was the hardware architecture. The machine running these tests uses an Intel Core Ultra 7 155U (Meteor Lake)—a chip with a complex hybrid topology of Performance Cores (P-cores), Efficient Cores (E-cores), and Low-Power Efficient Cores (LP-E cores). When you run a tight, CPU-bound loop on Windows, the OS Thread Director sees a process maxing out a core without doing any I/O. To manage thermals and battery life, it classifies your benchmark as a “background task” and dynamically unplugs your thread from a high-frequency P-core and shoves it onto a low-frequency E-core mid-execution. To prove this, I introduced a programmatic intervention. Using the core_affinity crate, I first ran it with no thread lock, then locked the benchmark thread strictly to core 0 (a guaranteed P-core), and finally on the last core. Rust if let Some(core_ids) = core_affinity::get_core_ids() { core_affinity::set_for_current(core_ids[0]);

The result was immediate. The ±125% variance vanished. The execution times flatlined into near-perfect consistency. But the anomaly survived: Windows was still drastically slower than WSL. THE PROCESSOR FINDING: POWER POLICY VS SCHEDULING

This is where isolating the variables pays off. Thread pinning solved the variance, but it didn’t close the performance gap. This proved that the two symptoms had completely different root causes. The variance was a scheduling artifact. The performance gap is a hardware power policy. When I pinned the thread to the P-core, I forced Windows to hold the workload there. But Windows power management has strict rules. When it sees a sustained 100% load on a P-core, it enforces a hard thermal and power cap, throttling the clock speed to a flat, unyielding 1700 MHz. WSL2, on the other hand, runs on a Linux kernel inside a lightweight Hyper-V virtual machine. The hypervisor abstracts away the aggressive Windows user-space power profiles. When the WSL kernel asks for compute, the hypervisor grants it, allowing the CPU to hold a rock-solid 2688 MHz. FIXED VS. TUNABLE CONSTRAINTS

In systems engineering, every bottleneck falls into one of three buckets: fixed, tunable, or unknown. The goal of benchmarking isn’t just to measure speed; it’s to move bottlenecks out of the “unknown” bucket. The Scheduler (Tunable): The OS moving your thread to an E-core is a tunable constraint. You can write code to fight it (thread pinning/affinity masks).

The Power Cap (Fixed): The 1700 MHz throttle on native Windows is a fixed constraint. Short of diving into the BIOS or hacking the registry to override the laptop’s thermal limits, your application code cannot spin the silicon faster than the OS allows.

In a production trading environment, you care deeply about both, but for different reasons. You can engineer a system around a consistently slow clock speed. If you know the machine runs at 1700 MHz, you calculate your latency budgets accordingly. But you cannot engineer around a system that randomly pauses for 99µs to context-switch your critical path onto an efficiency core. You pay for tail latency, and unmanaged hybrid scheduling creates a tail you cannot predict. THE HONEST SYSTEM

If I had just looked at the averages, I would have concluded that “WSL is faster than Windows” and moved on. But by digging into the variance, the real lesson emerged: Thread pinning didn’t make Windows faster. It made...

My Rust SIMD code was silently running as scalar Part 2

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy