Performance: unroll popcount reductions and add a NextClear fast path by lemire · Pull Request #219 · bits-and-blooms/bitset · GitHub
//voltron/pull_requests_fragments/pull_request_layout" data-turbo-transient="true" />
Skip to content
Search or jump to...
Search code, repositories, users, issues, pull requests...
-->
Search
Clear
Search syntax tips
Provide feedback
--><br>We read every piece of feedback, and take your input very seriously.
Include my email address so I can be contacted
Cancel
Submit feedback
Saved searches
Use saved searches to filter your results more quickly
-->
Name
Query
To see all available qualifiers, see our documentation.
Cancel
Create saved search
Sign in
//voltron/pull_requests_fragments/pull_request_layout;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up
Appearance settings
Resetting focus
You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.
Dismiss alert
{{ message }}
bits-and-blooms
bitset
Public
Uh oh!
There was an error while loading. Please reload this page.
Notifications<br>You must be signed in to change notification settings
Fork<br>188
Star<br>1.5k
Merged<br>Performance: unroll popcount reductions and add a NextClear fast path#219lemire merged 1 commit intomasterbits-and-blooms/bitset:masterfrom minor_optimizationsbits-and-blooms/bitset:minor_optimizationsCopy head branch name to clipboard
Conversation
Copy link
Copy Markdown
Member
lemire
commented
Jun 6, 2026
This PR speeds up three hot paths in the bitset package.
Benchmark<br>Before<br>After<br>Change
LemireCount (popcount of ~1.5M words)<br>820.1 µs<br>486.8 µs<br>−40.6%
DifferenceCardinality<br>928.2 ns<br>515.9 ns<br>−44.4%
NextClear<br>828.5 ns<br>498.7 ns<br>−39.8%
popcntSlice and popcntMaskSlice both summed one popcount per loop iteration
into a single accumulator:
for _, x := range s {<br>cnt += uint64(bits.OnesCount64(x))
bits.OnesCount64 lowers to the hardware POPCNT instruction. On modern x86,
POPCNT has a throughput of one per cycle but a latency of ~3 cycles. With a
single accumulator, each cnt += ... must wait for the previous addition to
retire, so the loop is latency-bound: it runs at roughly one popcount every
3 cycles instead of one per cycle.
The fix is the classic technique of unrolling by four with four independent
accumulators, which breaks the dependency chain and lets the CPU keep multiple
POPCNT units in flight. A scalar tail loop handles the remainder when the
length is not a multiple of four:
var c0, c1, c2, c3 uint64<br>n := len(s)<br>i := 0<br>for ; i n-4; i += 4 {<br>c0 += uint64(bits.OnesCount64(s[i]))<br>c1 += uint64(bits.OnesCount64(s[i+1]))<br>c2 += uint64(bits.OnesCount64(s[i+2]))<br>c3 += uint64(bits.OnesCount64(s[i+3]))<br>cnt = c0 + c1 + c2 + c3<br>for ; i n; i++ {<br>cnt += uint64(bits.OnesCount64(s[i]))
The same shape is applied to popcntMaskSlice (which counts s[i] &^ m[i]),
preserving the existing bounds-check-elimination hint.
-->
Sorry, something went wrong.
Uh oh!
There was an error while loading. Please reload this page.
-->
All reactions
minor optimizations (unrolling +)
bfd1801
Hide details<br>View details
lemire
merged commit 14f71e4<br>into
master
Jun 6, 2026
58 checks passed
Uh oh!
There was an error while loading. Please reload this page.
-->
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.<br>Learn more about bidirectional Unicode characters
Show hidden characters
Sign up for free<br>to join this conversation on GitHub .<br>Already have an account?<br>Sign in to comment
-->
Reviewers
No reviews
-->
Assignees
No one assigned
Labels
None yet
-->
Projects
None yet
-->
Milestone
No milestone
-->
Development
Successfully merging this pull request may close these issues.
Uh oh!
There was an error while loading. Please reload this page.
1 participant
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
You can’t perform that action at this...