Training an LLM in Swift, Part 2: macOS built-in frameworks

Training an LLM in Swift, Part 2: macOS built-in frameworks | Cocoa with Love

June 8, 2026 by Matt Gallagher

Training an LLM in Swift, Part 2: macOS built-in frameworks Covering the Accelerate, BNNS, CoreML and MPS implementations in macOS.

swiftllmmachine-learningperformancemetal

In this article, I’m going to look at some of the frameworks that are built into macOS for numerical algorithms. Fitting with the theme of this series, I’ll mostly be looking at frameworks that can also train an ML model. But there’s a lot of different approaches – Accelerate (BLAS), BNNS, CoreML, MPSGraph – and the real challenge is knowing which one to use – if they are even usable for training.

What on earth is “fchGelu”?

As with last time, I’ll be talking about code in examples but I won’t really be explaining the mechanics of LLMs or the terminology. I’m really here to talk about the macOS frameworks, not the models. I know, it’s pretty cryptic but this isn’t a beginner’s course. If you want to learn more about the terminology, try watching Andrej Karpathy’s Let’s reproduce GPT-2 (124M) where he goes through everything.

Accelerate (BLAS)

The first macOS library for machine learning that I want to talk about is the Accelerate framework. Accelerate is really an umbrella framework for a handful of smaller libraries. Accelerate is a critical library in macOS but one that you can spend a whole career without directly using. It’s been around since Mac OS X 10.2 Jaguar as separate vecLib, BLAS and LAPACK libraries and then in Mac OS X 10.3 Panther was unified into Accelerate. Remember the big cat names? Fun times.

In a general sense, Accelerate contains reusable algorithms optimized for SIMD vector instructions. In Swift, we don’t strictly need the Accelerate framework for SIMD vectorization (in the previous article, I used Relaxed.multiplyAdd and Swift’s autovectorization to get excellent SIMD vectorization) but it’s still really helpful to use Accelerate when you don’t want to stare at your own assembly.

As an example, I recently added rendering to a simple PDF parser library and used the following code based on Accelerate’s vImage to apply an image mask:

if let matte { guard var matteBuffer = try? vImage_Buffer(width: width, height: height, bitsPerPixel: 8) else { return nil } defer { matteBuffer.free() } vImageBufferFill_ARGB8888(&matteBuffer, [1, matte.r, matte.g, matte.b], vImage_Flags(kvImageNoFlags)) vImageAlphaBlend_ARGB8888(&baseBuffer, &matteBuffer, &baseBuffer, vImage_Flags(kvImageNoFlags))

which ended up being about 5 times faster than the raw pixel iteration that I was using before (and about 20 times faster in Debug builds).

You might think you could do this by drawing into a CGContext, and you’d be correct but guess what that uses internally? Same functions. All I’m doing here is cutting out the middleman and giving myself a little more direct control.

Getting back to the matrix multiplication topic from the previous article, Accelerate offers us its BLAS sgemm implementation. BLAS stands for “Basic Linear Algebra Subprograms” and sgemm stands for “Single precision GEneral Matrix Multiplication”. Having someone else optimize matrix multiplication is good but Accelerate BLAS offers another key advantage: it lets us access the Apple Silicon AMX unit without the ugly hacks that I needed in the previous article.

To see how it works, let’s consider the basic (naïve) matrix multiplication kernel in Swift from last time:

static func matmul_forward(out: inout [Float], inp: [Float], weight: [Float], bias: [Float]?, B: Int, T: Int, C: Int, OC: Int) { for b in 0..B { for t in 0..T { let bt = b * T + t for o in 0..OC { var val = bias?[o] ?? 0 for i in 0..C { val += inp[bt * C + i] * weight[o * C + i] out[bt * OC + o] = val

Doing the same thing with BLAS looks like this:

static func matmul_forward(out: inout [Float], inp: [Float], weight: [Float], bias: [Float]?, B: Int, T: Int, C: Int, OC: Int) { cblas_sgemm(CblasColMajor, CblasTrans, CblasNoTrans, Int32(OC), Int32(B * T), Int32(C), 1, weight, Int32(C), inp, Int32(C), 0, &out, Int32(OC))

guard var bias else { return }

out.withUnsafeMutableBufferPointer { outBuffer in guard let outBase = outBuffer.baseAddress else { return } for bt in 0..(B * T) { cblas_saxpy(Int32(OC), 1, &bias, 1, outBase.advanced(by: bt * OC), 1)

The matmul_forward function is almost exactly the same as a typical sgemm function, just fused with an additional bias step.

Ignoring all the other optimizations we needed in the previous article, just using cblas_sgemm in the 9 places in the “Basic Swift” implementation where it applies, gives us:

Model Tokens/s Training iterations/s

Basic Swift 0.054 0.014

AMX 5.884 1.678

Accelerate BLAS 8.086 2.015

This one change (in 9 places) made the “Basic Swift”...

Training an LLM in Swift, Part 2: macOS built-in frameworks

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy