Low-level Haskell: The cursed way to emulate inline assembly in Haskell/GHC, or

fanf21 pts0 comments

Low-level Haskell: The cursed way to emulate inline assembly in Haskell/GHC, or how to return multiple values from a foreign function - Mizuki's Blog

Mizuki's Blog

Home<br>About<br>Contact<br>Archive

Low-level Haskell: The cursed way to emulate inline assembly in Haskell/GHC, or how to return multiple values from a foreign function

Posted on June 30, 2026

This article is an English version of my earlier post “【低レベルHaskell】Haskell (GHC) でもインラインアセンブリに肉薄したい!” (in Japanese). The translation was assisted by AI (if you don’t like reading AI-generated content, please read the Japanese version!).

Modern CPUs have many instructions specialized for particular purposes. Examples include SIMD, instructions useful for hashing and cryptography, and a variety of others. C and C++ have inline assembly and intrinsics, which let you write code that takes advantage of such instructions.

Haskell (GHC), on the other hand, has no such mechanism. But that’s no reason to give up just yet. Let’s find a way to invoke obscure CPU instructions from Haskell, and as efficiently as possible.

First, let me list a few CPU instructions that would be nice to use from Haskell.

The subject: the high and low halves of a product of 64-bit integers

Consider computing the product of two 64-bit integers, obtaining both the high 64 bits and the low 64 bits (128 bits in total).

The ordinary multiplication found in C and Haskell, (*) :: Word64 -> Word64 -> Word64, can only compute the low 64 bits. At the machine-code / assembly level on x86, however, the high 64 bits are computed alongside the product as well.

For this kind of processing — “easy at the machine-code level, but non-trivial at the C or Haskell level” — we’d like to use inline assembly or intrinsics.

(Actually, GHC has an intrinsic timesWord2# :: Word# -> Word# -> (# Word#, Word# #), so you can do this in one shot using it. I chose this subject anyway so that we can measure how much slower the alternatives get compared to a GHC intrinsic.)

As another subject, carry-less multiplication (polynomial multiplication over a finite field) would also be useful for certain purposes. I won’t go into detail in this article, but I’ve placed the test results in the repository.

In C

GCC/Clang have the __int128 type, so you can compute this in one shot using it. No inline assembly or intrinsics required.

unsigned __int128 wideningMul(uint64_t a, uint64_t b)<br>return (unsigned __int128)a * (unsigned __int128)b;

If we deliberately wrote it with inline assembly, it might look like this:

uint64_t wideningMul_inlasm(uint64_t a, uint64_t b, uint64_t *outHigh)<br>uint64_t lo, hi;<br>asm("movq %2, %%rax;"<br>// mulq computes the product of %rax and the operand (here %3),<br>// placing the high 64 bits in %rdx and the low 64 bits in %rax<br>"mulq %3;"<br>"movq %%rax, %0;"<br>"movq %%rdx, %1;"<br>: "=r"(lo), "=r"(hi)<br>: "r"(a), "r"(b)<br>: "%rax", "%rdx");<br>*outHigh = hi;<br>return lo;

How to return multiple values

Now, this operation takes two uint64_ts and returns a 128-bit value — that is, two uint64_ts. Since C’s syntax has no multiple-value return, you have to choose one of the following ways to return the values:

Return a struct by value: define a struct like struct uint128 { uint64_t lo, hi; } and return it by value.

Returning unsigned __int128 by value corresponds to this internally. See the x86_64 ABI for details.

Take and pass a pointer: take the location where the second and later return values should be stored as a pointer argument.

Example: the wideningMul_inlasm function I wrote earlier.

As an example of the former, the C standard div, ldiv, and lldiv functions return a {,l,ll}div_t struct by value.

The advantage of returning a struct by value is that, depending on the ABI, if the struct is small the values can be returned while kept in registers.

The disadvantage, on the other hand, is that other languages’ C FFI may not support it. In fact, GHC’s current C FFI does not support passing structs by value.

There is a proposal to make C structs passable by value through the FFI, but it has seen no movement:

c structures · Wiki · Glasgow Haskell Compiler / GHC · GitLab

Support C structures in Haskell FFI (#9700) · Issue · ghc/ghc

Using the C FFI (with a pointer)

Safe FFI

When there’s something Haskell can’t do, let’s borrow the power of another language! To that end, Haskell has an FFI. With it, you can call functions written in C.

Let’s give it a try right away:

#include

extern uint64_t wideningMul_with_ptr(uint64_t a, uint64_t b, uint64_t *outHigh)<br>unsigned __int128 result = (unsigned __int128)a * (unsigned __int128)b;<br>*outHigh = (uint64_t)(result >> 64);<br>return (uint64_t)result;

As I wrote earlier, the current GHC FFI can’t pass structs by value, so we’ll pass one of the return values via a pointer.

The Haskell side looks like this:

foreign import ccall "wideningMul_with_ptr"<br>c_wideningMul_with_ptr :: Word64 -> Word64 -> Ptr Word64 -> IO Word64

wideningMulWithPtr :: Word64 -> Word64 ->...

haskell uint64_t return value word64 assembly

Related Articles