Tuning LLVM's SLP Vectorizer Cost Model

Tuning LLVM's SLP Vectorizer Cost Model – KG's Blog

CTRL K

Blog

The Regression

The Where?

The Why?

Landing a fix

Tracking down a 25% Regression on LLVM RISC-V

Analysing a Benchmark on LLVM RISCV

My intro to the RISCV LLVM Backend: modifying an existing LLVM pass to merge GPRPair moves

My First LLVM PR

Nes Emulator

Rust-based Quadcopter Part 1

LightDark

System

Light

Dark

System

Welcome

Tuning LLVM's SLP Vectorizer Cost Model

May 24, 2026· Kavin Gnanapandithan

Similar to my last post, this writeup covers how I solved a performance regression on LLVM by analyzing a benchmark from a RISCV target.

TLDR A recent LLVM patch introduced ordered vector reductions to replace a chain of scalar fadds, but it triggered a performance regression on a benchmark by failing to account for cost of building the initial vector per iteration. This in turned caused unprofitable code to be deemed “profitable.” PR, Issue

The Regression

Looking at Igalia’s LNT instance for the BPI-F3, I noticed this particular benchmark with a delta of 89%. Specifically, there was an increase in ~26% issued instructions and a ~48% increase in cycles.

I have attached two more pictures right below, with the first one being the assembly of a basic block from the older build and the corresponding assembly from the newer build.

Info

Bn here refers to Billions of cycles. This basic block is basically taking twice as many cycles to execute.

We can see that that newer build of LLVM is performing a sequence of fsd instructions, also known as Float Store Double . It’s essentially storing the floating point values from those registers onto the stack. Specifically, it’s storing the value at the address s1 + 0x80.

From a preceding basic block that I have not included here, I know that value of the register a5 to be equal to s1 + 0x80 from this instruction.

addi a5, s1, 0x80

The Vector Load Instruction vle64.v is loading the values from memory at the address at a5 (s1 + 0x80) into the vector register v16.

v16 = M[a5]

Finally, it executes the vfredosum.vs instruction (Ordered floating-point sum), which performas the following for a vector register of size VL.

\[ vd[0] = \left( \dots \left( \left( vs1[0] + vs2[0] \right) + vs2[1] \right) + \dots + vs2[VL-1] \right) \]The new codegen is basically trying to replace the ordered fadd instructions in the first basic block with this vector sum reduction instruction. I hope this diagram may illustrate this better, with what was previously happening versus what is currently occurring. From the images above, it can be observed that the new code is significantly more expensive in terms of cycles.

graph LR %% Left-to-Right top level allows parallel tracks to scale height independently V_IN["Original Source Data(Residing in Scalar Registers)"]

%% Link to Scalar Track V_IN -->|Old Execution| S1

%% Link to Vector Track V_IN -->|New Execution| FSD1

%% Left Side Track: Original Scalar Chain subgraph ScalarChain ["Original Intent: Ordered fadd Chain"] direction TB S1["fadd (START_VAL + SCALAR_VAL_0)"] --> S2["fadd (Result + SCALAR_VAL_1)"] S2 --> S3["fadd (Result + SCALAR_VAL_2)"] S3 --> S4["fadd (Result + SCALAR_VAL_3)FINAL_SCALAR_SUM"] end

%% Right Side Track Part 1: Memory Gather (Now cleanly scales to its actual contents) subgraph Gather ["Memory Gather (Stack Spilling Penalty)"] direction TB FSD1["Store SCALAR_VAL_0 to s1 + 0x80"] FSD2["Store SCALAR_VAL_1 to s1 + 0x88"] FSD3["Store SCALAR_VAL_2 to s1 + 0x90"] FSD4["Store SCALAR_VAL_3 to s1 + 0x98"] end

subgraph VectorOps ["Vector Load & Reduction"] direction TB VLE["Vector Load(Loads from memory into vector register v16)"] --> VFRED["Vector Ordered Sum Reduction(vfredosum)"] end

%% Connect the two vector halves sequentially FSD4 -->VLE

%% --- STYLING BLOCK --- style ScalarChain fill:none,stroke:#888888,stroke-dasharray: 5 5

%% Gather Penalty (Soft Red Glow) style Gather fill:#b71c1c18,stroke:#ef4444,stroke-width:2px,rx:8,ry:8

%% Vector Operations (Soft Blue Glow) style VectorOps fill:#0d47a118,stroke:#3b82f6,stroke-width:2px,rx:8,ry:8

Info

If you visit the link yourself, you may notice that there is also another basic block further down that also has a significant increase in cycles compared to its older counterpart. I chose not to include that as both of them are identical, so fixing one fixes the other.

The Where?

To narrow down where these new fsd and vfredosum.vs instructions are introduced, I ran the command below to get emit the LLVM IR. The output of this commend will give us the intermediate representation produced by the middle-end. If we can observe IR code that would result in those instructions, we can rule out the backend.

$lbd/bin/clang -O3 \ --target=riscv64-unknown-linux-gnu \ -march=rva22u64_v \ --gcc-toolchain=/usr \ --sysroot=/usr/riscv64-linux-gnu \ -I. \ -DFP_ABSTOLERANCE=1e-5...

Tuning LLVM's SLP Vectorizer Cost Model

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy

SpaceX not the behemoth everyone thought

The Mirror Is Part of the Machine