Performance in BQN versus C

tosh1 pts0 comments

Performance in BQN versus C

(github) / BQN / implementation

Performance in BQN versus C

The title alone! You know BQN has some fast performance to wave around!

If you think it's paradoxical that programs written for CBQN, a C-based interpreter, might outperform programs written directly in C, it's because you think performance comes from the language implementation. Not so: performance comes from the programmer, taking advantage of the features offered by their language implementation. The difficulty of doing so has two consequences:

BQN programs can and do outperform equivalent C programs.

Your BQN programs are unlikely to outperform equivalent C programs.

Because once you have a working program, you are probably going to come to the realization that you'd rather do something else with your time than study and optimize until it can beat the C program that you never even wrote in the first place. You wouldn't have made the C program as fast as possible either: that only makes sense for programs that are used over and over in a performance-sensitive context and simple enough to optimize well. Like… the CBQN interpreter.

Here's a highly authoritative graph on how the effort to speed tradeoff could work, in one case, maybe.

Non-portable

BQN

Effort<br>Speed

I'm considering something basically straightforward (no big algorithmic improvements) and that can be implemented with whole arrays, but not trivially. BQN lets you quickly write a working program, but the easy way usually involves lots of small arrays and slow iteration. C on the other hand is notorious for being hard to write, but once you do it you'll get good performance if you didn't cut corners. Going beyond to get the most out of a CPU takes a lot more work, and often manual use of SIMD or other intrinsics, which isn't portable. CBQN gives you access to some of those methods without as much effort, which is how it can end up beating a basic C program. But evaluating one primitive at a time has overhead, so it's never as good as specialized C.

A major factor that separates low- and high-performance programs in both C and BQN is not actually time spent but experience with the required programming techniques. In portable C this means branchless tricks, bit manipulation, and writing things an auto-vectorizer can handle. In BQN it's a totally different set of skills needed to write in an array-oriented style. Surprisingly, non-portable C often benefits from those array-oriented skills as well: I often find better ways to implement BQN primitives by thinking in terms of other primitives!

Case studies

All right, how can I so confidently claim that CBQN can—sometimes—be faster than C in practice? First off, there's a certain class of problems where it's routine: BQN primitives. Every one that does anything, really. Scan, Transpose, Indices, Sort, Modulus, Reshape. And so on. We write the ordinary C implementation, and it's just not good enough, so we have to use lookup tables, SIMD, blocked multi-pass techniques, and more. A 10x improvement over ordinary C code is completely normal. But this is the best possible case for BQN; combinations of primitives never do so well.

Small real-world problems can still show a major difference. In my first talk at Dyalog (video, zipped slides), as well as a follow-up next year (video, zipped slides), I considered the problem of replacing every CRLF line ending in a file with just the second character LF. BQN nails this one, breaking even with C at a little under 200 bytes and hitting 5x C's speed on inputs of a few thousand bytes or more in my testing. That's with AVX-512 disabled; with it BQN is over 10x faster until cache becomes a bottleneck.

(Benchmark details)

The benchmark is run with a CRLF every 100 characters on average, placed with a simple LCG for reproducibility. This is just the number I picked in the Dyalog presentation and isn't particularly favorable to BQN, as it only gets an advantage from sparse Replicate at lower densities like 1/1000 and branching C code would be penalized at higher ones like 1/4. C code is also taken from the Dyalog talk. gcc 12.2.1 generates short branching code with the if-based function, while clang 15.0.7 converts both functions to branchless and unrolls by a factor of 4. The two are very close in speed with density 1/100, and gcc is slightly faster at lower densities and much slower at high ones.

lineending.bqn:

cr‿lf ← @+13‿10<br>CRLF_to_LF ← (cr⊸≠ ∨ ·«lf⊸≠)⊸/

n ← 1e6 ⋄ m ← n÷100<br>r ← 29 ⋄ LCG ← {(1-˜2⋆31)|16807×𝕩}<br>str ← {i←(n-1)|r ⋄ r LCG↩ ⋄ cr⌾(i⊸⊑)lf⌾((i+1)⊸⊑)𝕩}⍟m @+100|↕n

ls ← ≤⟜n⊸/ ⥊(10⋆2+↕5)×⌜1‿3<br>Disp ← { •Out "ns/elt" ∾˜ ∾ ∾⟜" "¨ ¯7‿8 ↑⟜•Repr¨ 𝕩 }<br>{𝕊l: Disp l ⋈ 1e9×l÷˜ (⌊5e8÷l) CRLF_to_LF•_timed l↑str}¨ ls

lineending.c:

#include "stdlib.h"<br>#include "stdio.h"<br>#include "time.h"

#if 0<br>// Branchless<br>__attribute((noinline)) void crlf_to_lf(char* dst, char* src, size_t n) {<br>int was_cr = 0;<br>for (size_t i=0; in; i++) {<br>char c = src[i];<br>dst -= (was_cr &&...

performance programs from program implementation cbqn

Related Articles