Hot path optimization. When float division beats integer division

Optimization catalog. When float division beats integer division

A counterintuitive hot-path optimization: Swapping IDIVQ for DIVSD to divide integers faster

June 08, 2026

I did quite a lot of hot path optimizations while working on my last project and some of them were genuinely interesting ones. So why don't I write about some of them. I'm starting the series with an interesting case where speedup was achieved by unintuitive replacement of int operation with float operation.

Avoiding IDIVQ with float division

Let's take a look at this optimization (ignore the 3x claim, I was wrong in that doc comment):

diff --git a/huffman.go b/huffman.go index 33f8263..180dffc 100644 --- a/huffman.go +++ b/huffman.go @@ -378,7 +378,11 @@ func optimizeHuffmanCountsForRLE(counts []uint32, goodForRLEBuf *[]bool) { if i != length { sum += int(counts[i]) if stride >= 4 { - limit = (256*sum + stride/2) / stride + // float64 division is ~3x faster than IDIVQ for variable + // divisors, and exact for our input range (numerator + // denominator + // representable in float64's 52-bit mantissa). + limit = int(float64(256*sum+stride/2) / float64(stride)) if stride == 4 { limit += 120

I'm replacing integer division (256*sum + stride/2) / stride with float division and converting the result back to int: int(float64(256*sum+stride/2) / float64(stride)). But aren't the integer operations inherently faster than the float operations? Well yes, but actually no. Not in this case.

Before we do anything let's run benchmarks to confirm that the optimization is real (benchmark code):

goos: linux goarch: amd64 pkg: idivq cpu: 12th Gen Intel(R) Core(TM) i5-12500 │ idivq │ float │ │ sec/op │ sec/op vs base │ DivInline 3.358n ± 0% 2.414n ± 0% -28.10% (p=0.002 n=6)

Ok it's real, 28% win on the microbenchmark. Let's also run the benchmark on AMD CPU:

goos: linux goarch: amd64 pkg: idivq cpu: AMD Ryzen 5 7535HS with Radeon Graphics │ idivq │ float │ │ sec/op │ sec/op vs base │ DivInline 2.178n ± 0% 1.959n ± 0% -10.08% (p=0.002 n=6)

~10% win.

So why is it faster? Let's compare the assembly:

Integer (IDIVQ) Float (DIVSD)

Compute 256*sum + stride/2

SHLQ $8, AX MOVQ BX, CX SHRQ $63, BX LEAQ (BX)(CX*1), DX SARQ $1, DX ADDQ DX, AX

Compute 256*sum + stride/2

SHLQ $8, AX MOVQ BX, CX SHRQ $63, BX LEAQ (CX)(BX*1), DX SARQ $1, DX ADDQ AX, DX

TESTQ CX, CX JEQ panicdivide CMPQ CX, $-1 JNE do_div NEGQ AX XORL DX, DX JMP ret

The division by zero check is missing in the float version. As well as handling of overflow special case stride == -1.

Uses very expensive IDIVQ instruction to do the actual division

CQO IDIVQ CX

Uses relatively cheap DIVSD instruction but adds a bunch of code to do intfloat conversions

XORPS X0, X0 CVTSQ2SD DX, X0 XORPS X1, X1 CVTSQ2SD CX, X1 DIVSD X1, X0 CVTTSD2SQ X0, AX

The key point is that IDIVQ is replaced with DIVSD. According to uops.info the IDIVQ gives latency of 14-18 and throughput of 10 cycles (see link). And DIVSD has latency 13 and throughput of 4 cycles (link). So I thought that the number I should look for is throughput.

Why throughput and not latency? Because in this microbenchmark the divisions are independent. One result doesn't depend on the previous one. So the divider unit stays "saturated" as it receives the work at the fastest possible rate. And the rate is bottlenecked by the throughput and not latency.

So the throughput was supposed to drop from 10.0 to 4.0 and that's why I wrote 3x in the doc comment above (should have written 2.5x to be more precise as 10/4=2.5). So why then didn't I get the speedup of 2.5x but rather got only ~1.4x (the 28% Intel win)?

To answer this question I ran perf stat and collected the arith.divider_active metric. The arith.divider_active counts cycles during which the cpu divide unit was busy.

idivqfloat arith.divider_active / op10.00 cyc6.35 cyc total cycles / op10.03 cyc7.29 cyc

The perf says that 10 cycles is spent on IDIVQ exactly as uops.info has predicted (using throughput rather than latency was a right choice after all). While we are here note that total cycles in this case is only 0.03 cycles extra which means that all that machinery compiler added to check the divide-by-zero and overflow is negligible (that was another question I wanted to ask - if the speedup is actually caused by that machinery elimination and perf showed it was not the case. The speedup is truly because of using more performant DIVSD instruction). In the float (DIVSD) case the cpu spends 6.35 cycles on division, not 4.0 cycles as I was expecting. Also it spends ~1 cycle more on some extra work making it 7.29 cycles per op in total. Is uops.info wrong then?

I think it's not. I'm fairly confident that DIVSD timing is operand-dependent

the 6.35 vs 4 gap is itself a sign of it, since uops.info measures throughput with one fixed operand set. I don't have a decisive explanation for the exact rule,...

Hot path optimization. When float division beats integer division

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy