Performance: Unroll popcount reductions and add a NextClear fast path

tosh1 pts0 comments

Performance: unroll popcount reductions and add a NextClear fast path by lemire · Pull Request #219 · bits-and-blooms/bitset · GitHub

//voltron/pull_requests_fragments/pull_request_layout" data-turbo-transient="true" />

Skip to content

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Search

Clear

Search syntax tips

Provide feedback

--><br>We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

Sign in

//voltron/pull_requests_fragments/pull_request_layout;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

{{ message }}

bits-and-blooms

bitset

Public

Uh oh!

There was an error while loading. Please reload this page.

Notifications<br>You must be signed in to change notification settings

Fork<br>188

Star<br>1.5k

Merged<br>Performance: unroll popcount reductions and add a NextClear fast path#219lemire merged 1 commit intomasterbits-and-blooms/bitset:masterfrom minor_optimizationsbits-and-blooms/bitset:minor_optimizationsCopy head branch name to clipboard

Conversation

Copy link

Copy Markdown

Member

lemire

commented

Jun 6, 2026

This PR speeds up three hot paths in the bitset package.

Benchmark<br>Before<br>After<br>Change

LemireCount (popcount of ~1.5M words)<br>820.1 µs<br>486.8 µs<br>−40.6%

DifferenceCardinality<br>928.2 ns<br>515.9 ns<br>−44.4%

NextClear<br>828.5 ns<br>498.7 ns<br>−39.8%

popcntSlice and popcntMaskSlice both summed one popcount per loop iteration

into a single accumulator:

for _, x := range s {<br>cnt += uint64(bits.OnesCount64(x))

bits.OnesCount64 lowers to the hardware POPCNT instruction. On modern x86,

POPCNT has a throughput of one per cycle but a latency of ~3 cycles. With a

single accumulator, each cnt += ... must wait for the previous addition to

retire, so the loop is latency-bound: it runs at roughly one popcount every

3 cycles instead of one per cycle.

The fix is the classic technique of unrolling by four with four independent

accumulators, which breaks the dependency chain and lets the CPU keep multiple

POPCNT units in flight. A scalar tail loop handles the remainder when the

length is not a multiple of four:

var c0, c1, c2, c3 uint64<br>n := len(s)<br>i := 0<br>for ; i n-4; i += 4 {<br>c0 += uint64(bits.OnesCount64(s[i]))<br>c1 += uint64(bits.OnesCount64(s[i+1]))<br>c2 += uint64(bits.OnesCount64(s[i+2]))<br>c3 += uint64(bits.OnesCount64(s[i+3]))<br>cnt = c0 + c1 + c2 + c3<br>for ; i n; i++ {<br>cnt += uint64(bits.OnesCount64(s[i]))

The same shape is applied to popcntMaskSlice (which counts s[i] &^ m[i]),

preserving the existing bounds-check-elimination hint.

-->

Sorry, something went wrong.

Uh oh!

There was an error while loading. Please reload this page.

-->

All reactions

minor optimizations (unrolling +)

bfd1801

Hide details<br>View details

lemire

merged commit 14f71e4<br>into

master

Jun 6, 2026

58 checks passed

Uh oh!

There was an error while loading. Please reload this page.

-->

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.<br>Learn more about bidirectional Unicode characters

Show hidden characters

Sign up for free<br>to join this conversation on GitHub .<br>Already have an account?<br>Sign in to comment

-->

Reviewers

No reviews

-->

Assignees

No one assigned

Labels

None yet

-->

Projects

None yet

-->

Milestone

No milestone

-->

Development

Successfully merging this pull request may close these issues.

Uh oh!

There was an error while loading. Please reload this page.

1 participant

Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

You can’t perform that action at this...

applied bits reload while uint64 onescount64

Related Articles