Loop Fusion in Array Languages

tosh1 pts0 comments

BQN: Loop fusion in array languages

(github) / BQN / implementation / compile

Loop fusion in array languages

Interpreted array languages have a major problem. Let's say you evaluate some arithmetic on a few arrays. Perhaps the first operation adds two arrays. It will loop over them, ideally adding numbers a vector register at a time, and write the results to an array. Maybe next it will check if the result is more than 10. So it'll read vectors from the result, compare to 10, pack to bit booleans, and write to another array. Each primitive has been implemented well but the combination is already far from optimal! The first result array isn't needed: it would be much better to compare each added vector to 10 right when it's produced. The extra store and load (and index arithmetic) are instructions that we don't need, but by using extra memory we can also create cache pressure that slows down the program even more.

Scalar languages don't have this problem. The programmer just writes the addition and comparison in a loop, the compiler compiles it, and every comparison naturally follows the corresponding addition. More modern languages might prefer approaches like iterators that abstract away the looping but still have the semantics of a fused loop. But an iterator call, let's say zipwith(+, a.iter(), b.iter()).map(>10) to make up some syntax, has a pretty obvious array equivalent, and if the functions are pure the different semantics don't matter! This has led to several compiled array languages like APEX that work on the principle of re-interpreting the scalar parts of array operations in a way that fuses naturally.

Scalar compilation gives up many advantages inherent to array programming, a topic I discussed more broadly here. The obvious complaint is that you lose the vector instructions, but that's easy enough to dismiss. Any decent C compiler can auto-vectorize a loop, and so could an array compiler. But arithmetic is rarely the bottleneck, so let's say that the comparison's result will be used to filter a third array, that is, the expression is now (10a+b)/c. Filtering doesn't auto-vectorize! At least the C compilers I've dealt with will fall back to producing completely scalar code. Depending on type, this can actually be slower than CBQN's un-fused, but SIMD, primitives.

In this case the problem is more that compilers don't know how to vectorize filtering, since in AVX-512 at least there's an instruction to do it and write a partial result. Fusing the result into more arithmetic later would be a more fundamental difficulty, because one round of Replicate produces an unknown number of values so it can't be directly paired with an input vector. What we need is a way to cut the fusion at this point, writing to memory as an escape. I believe array languages are best served by two levels of fusion, a looser level that ensures this memory use isn't excessive, and tighter fusion at the level of registers.

Blocking and cache levels

The loosest form of loop fusion goes by various names such as blocking, chunking, or tiling. Instead of running each primitive on the entire array, we run it on a block of a few kilobytes. Looping within a block stays separate, but the outer layer of looping over blocks can be fused. So in (10a+b)/c we'd add blocks from a and b, compare each one to 10, and use the result to filter a block of c, before moving on to the next set of blocks. This has the advantage that it doesn't actually require compilation, as blocks can still be processed with pre-compiled functions. It has the disadvantage that each block operation still reads and writes to memory—hang on, what problem are we actually trying to solve here?

For basic arithmetic, working from memory is a big relative cost, because even at the fastest cache level a load or store costs about as much the arithmetic itself. Heavier primitives like scans, filtering, transpose, or searching in a short list, do a lot more work in between, so if load and store only cost about as much as arithmetic that's actually pretty good. But large arrays don't fit in the fastest cache level. If a primitive writes a large result, then by the time it's done only a little piece at the end is still saved in L1. The next primitive will start at the beginning and miss L1 entirely! If the interpreter instead works in blocks that are significantly smaller than the L1 cache, accesses between primitives should stay within L1 and only the boundary of fusion, meaning the initial reads and final write, can be slow.

Blocking is still compatible with finer-grained fusion. Primitives should simply be fused as block operations and not whole-array operations.

Sounds simple, but blocking has its difficulties even once the primitives to be blocked have been identified. As is becoming a pattern, one-to-one arithmetic suffers from none of these and is trivial.

Do we want to block at all? For very small arrays, setting up the block system may be expensive....

array fusion loop languages arithmetic result

Related Articles