Hot path optimization. When float division beats integer division

birdculture1 pts0 comments

Optimization catalog. When float division beats integer division

Optimization catalog. When float division beats integer division

A counterintuitive hot-path optimization: Swapping IDIVQ for DIVSD to divide integers faster

June 08, 2026

I did quite a lot of hot path optimizations while working on my last project and<br>some of them were genuinely interesting ones. So why don't I write about some of<br>them. I'm starting the series with an interesting case where speedup was<br>achieved by unintuitive replacement of int operation with float operation.

Avoiding IDIVQ with float division

Let's take a look at this<br>optimization<br>(ignore the 3x claim, I was wrong in that doc comment):

diff --git a/huffman.go b/huffman.go<br>index 33f8263..180dffc 100644<br>--- a/huffman.go<br>+++ b/huffman.go<br>@@ -378,7 +378,11 @@ func optimizeHuffmanCountsForRLE(counts []uint32, goodForRLEBuf *[]bool) {<br>if i != length {<br>sum += int(counts[i])<br>if stride >= 4 {<br>- limit = (256*sum + stride/2) / stride<br>+ // float64 division is ~3x faster than IDIVQ for variable<br>+ // divisors, and exact for our input range (numerator + // denominator + // representable in float64's 52-bit mantissa).<br>+ limit = int(float64(256*sum+stride/2) / float64(stride))<br>if stride == 4 {<br>limit += 120

I'm replacing integer division (256*sum + stride/2) / stride with float<br>division and converting the result back to int: int(float64(256*sum+stride/2) / float64(stride)). But aren't the integer operations inherently faster than the<br>float operations? Well yes, but actually no. Not in this case.

Before we do anything let's run benchmarks to confirm that the optimization is<br>real (benchmark<br>code):

goos: linux<br>goarch: amd64<br>pkg: idivq<br>cpu: 12th Gen Intel(R) Core(TM) i5-12500<br>│ idivq │ float │<br>│ sec/op │ sec/op vs base │<br>DivInline 3.358n ± 0% 2.414n ± 0% -28.10% (p=0.002 n=6)

Ok it's real, 28% win on the microbenchmark. Let's also run the benchmark on AMD<br>CPU:

goos: linux<br>goarch: amd64<br>pkg: idivq<br>cpu: AMD Ryzen 5 7535HS with Radeon Graphics<br>│ idivq │ float │<br>│ sec/op │ sec/op vs base │<br>DivInline 2.178n ± 0% 1.959n ± 0% -10.08% (p=0.002 n=6)

~10% win.

So why is it faster? Let's compare the assembly:

Integer (IDIVQ) Float (DIVSD)

Compute 256*sum + stride/2

SHLQ $8, AX<br>MOVQ BX, CX<br>SHRQ $63, BX<br>LEAQ (BX)(CX*1), DX<br>SARQ $1, DX<br>ADDQ DX, AX

Compute 256*sum + stride/2

SHLQ $8, AX<br>MOVQ BX, CX<br>SHRQ $63, BX<br>LEAQ (CX)(BX*1), DX<br>SARQ $1, DX<br>ADDQ AX, DX

TESTQ CX, CX<br>JEQ panicdivide<br>CMPQ CX, $-1<br>JNE do_div<br>NEGQ AX<br>XORL DX, DX<br>JMP ret

The division by zero check is missing in the float version. As well as handling of overflow special<br>case stride == -1.

Uses very expensive IDIVQ instruction to do the actual division

CQO<br>IDIVQ CX

Uses relatively cheap DIVSD instruction but adds a bunch of code to<br>do intfloat conversions

XORPS X0, X0<br>CVTSQ2SD DX, X0<br>XORPS X1, X1<br>CVTSQ2SD CX, X1<br>DIVSD X1, X0<br>CVTTSD2SQ X0, AX

The key point is that IDIVQ is replaced with DIVSD. According to uops.info<br>the IDIVQ gives latency of 14-18 and throughput of 10 cycles (see<br>link). And DIVSD has<br>latency 13 and throughput of 4 cycles<br>(link). So I thought<br>that the number I should look for is throughput.

Why throughput and not latency? Because in this microbenchmark the divisions are<br>independent. One result doesn't depend on the previous one. So the divider unit<br>stays "saturated" as it receives the work at the fastest possible rate. And the<br>rate is bottlenecked by the throughput and not latency.

So the throughput was supposed to drop from 10.0 to 4.0 and that's why I wrote<br>3x in the doc comment above (should have written 2.5x to be more precise as<br>10/4=2.5). So why then didn't I get the speedup of 2.5x but rather got only<br>~1.4x (the 28% Intel win)?

To answer this question I ran perf stat and collected the<br>arith.divider_active metric. The arith.divider_active counts cycles during<br>which the cpu divide unit was busy.

idivqfloat<br>arith.divider_active / op10.00 cyc6.35 cyc<br>total cycles / op10.03 cyc7.29 cyc

The perf says that 10 cycles is spent on IDIVQ exactly as uops.info has<br>predicted (using throughput rather than latency was a right choice after all).<br>While we are here note that total cycles in this case is only 0.03 cycles extra<br>which means that all that machinery compiler added to check the divide-by-zero<br>and overflow is negligible (that was another question I wanted to ask - if the<br>speedup is actually caused by that machinery elimination and perf showed it was<br>not the case. The speedup is truly because of using more performant DIVSD<br>instruction). In the float (DIVSD) case the cpu spends 6.35 cycles on division,<br>not 4.0 cycles as I was expecting. Also it spends ~1 cycle more on some extra<br>work making it 7.29 cycles per op in total. Is uops.info wrong then?

I think it's not. I'm fairly confident that DIVSD timing is operand-dependent

the 6.35 vs 4 gap is itself a sign of it, since uops.info measures<br>throughput with one fixed operand set. I don't have a decisive explanation for<br>the exact rule,...

division idivq stride float cycles divsd

Related Articles