Training an LLM in Swift, Part 2: macOS built-in frameworks

cl3m1 pts0 comments

Training an LLM in Swift, Part 2: macOS built-in frameworks | Cocoa with Love

June 8, 2026<br>by Matt Gallagher

Training an LLM in Swift, Part 2: macOS built-in frameworks<br>Covering the Accelerate, BNNS, CoreML and MPS implementations in macOS.

swiftllmmachine-learningperformancemetal

In this article, I&rsquo;m going to look at some of the frameworks that are built into macOS for numerical algorithms. Fitting with the theme of this series, I&rsquo;ll mostly be looking at frameworks that can also train an ML model. But there&rsquo;s a lot of different approaches – Accelerate (BLAS), BNNS, CoreML, MPSGraph – and the real challenge is knowing which one to use – if they are even usable for training.

What on earth is &ldquo;fchGelu&rdquo;?

As with last time, I&rsquo;ll be talking about code in examples but I won&rsquo;t really be explaining the mechanics of LLMs or the terminology. I&rsquo;m really here to talk about the macOS frameworks, not the models. I know, it&rsquo;s pretty cryptic but this isn&rsquo;t a beginner&rsquo;s course. If you want to learn more about the terminology, try watching Andrej Karpathy&rsquo;s Let&rsquo;s reproduce GPT-2 (124M) where he goes through everything.

Accelerate (BLAS)

The first macOS library for machine learning that I want to talk about is the Accelerate framework. Accelerate is really an umbrella framework for a handful of smaller libraries. Accelerate is a critical library in macOS but one that you can spend a whole career without directly using. It&rsquo;s been around since Mac OS X 10.2 Jaguar as separate vecLib, BLAS and LAPACK libraries and then in Mac OS X 10.3 Panther was unified into Accelerate. Remember the big cat names? Fun times.

In a general sense, Accelerate contains reusable algorithms optimized for SIMD vector instructions. In Swift, we don&rsquo;t strictly need the Accelerate framework for SIMD vectorization (in the previous article, I used Relaxed.multiplyAdd and Swift&rsquo;s autovectorization to get excellent SIMD vectorization) but it&rsquo;s still really helpful to use Accelerate when you don&rsquo;t want to stare at your own assembly.

As an example, I recently added rendering to a simple PDF parser library and used the following code based on Accelerate&rsquo;s vImage to apply an image mask:

if let matte {<br>guard var matteBuffer = try? vImage_Buffer(width: width, height: height, bitsPerPixel: 8) else { return nil }<br>defer { matteBuffer.free() }<br>vImageBufferFill_ARGB8888(&matteBuffer, [1, matte.r, matte.g, matte.b], vImage_Flags(kvImageNoFlags))<br>vImageAlphaBlend_ARGB8888(&baseBuffer, &matteBuffer, &baseBuffer, vImage_Flags(kvImageNoFlags))

which ended up being about 5 times faster than the raw pixel iteration that I was using before (and about 20 times faster in Debug builds).

You might think you could do this by drawing into a CGContext, and you&rsquo;d be correct but guess what that uses internally? Same functions. All I&rsquo;m doing here is cutting out the middleman and giving myself a little more direct control.

Getting back to the matrix multiplication topic from the previous article, Accelerate offers us its BLAS sgemm implementation. BLAS stands for &ldquo;Basic Linear Algebra Subprograms&rdquo; and sgemm stands for &ldquo;Single precision GEneral Matrix Multiplication&rdquo;. Having someone else optimize matrix multiplication is good but Accelerate BLAS offers another key advantage: it lets us access the Apple Silicon AMX unit without the ugly hacks that I needed in the previous article.

To see how it works, let&rsquo;s consider the basic (naïve) matrix multiplication kernel in Swift from last time:

static func matmul_forward(out: inout [Float], inp: [Float], weight: [Float], bias: [Float]?, B: Int, T: Int, C: Int, OC: Int) {<br>for b in 0..B {<br>for t in 0..T {<br>let bt = b * T + t<br>for o in 0..OC {<br>var val = bias?[o] ?? 0<br>for i in 0..C {<br>val += inp[bt * C + i] * weight[o * C + i]<br>out[bt * OC + o] = val

Doing the same thing with BLAS looks like this:

static func matmul_forward(out: inout [Float], inp: [Float], weight: [Float], bias: [Float]?, B: Int, T: Int, C: Int, OC: Int) {<br>cblas_sgemm(CblasColMajor, CblasTrans, CblasNoTrans, Int32(OC), Int32(B * T), Int32(C), 1, weight, Int32(C), inp, Int32(C), 0, &out, Int32(OC))

guard var bias else { return }

out.withUnsafeMutableBufferPointer { outBuffer in<br>guard let outBase = outBuffer.baseAddress else { return }<br>for bt in 0..(B * T) {<br>cblas_saxpy(Int32(OC), 1, &bias, 1, outBase.advanced(by: bt * OC), 1)

The matmul_forward function is almost exactly the same as a typical sgemm function, just fused with an additional bias step.

Ignoring all the other optimizations we needed in the previous article, just using cblas_sgemm in the 9 places in the &ldquo;Basic Swift&rdquo; implementation where it applies, gives us:

Model<br>Tokens/s<br>Training iterations/s

Basic Swift<br>0.054<br>0.014

AMX<br>5.884<br>1.678

Accelerate BLAS<br>8.086<br>2.015

This one change (in 9 places) made the &ldquo;Basic Swift&rdquo;...

rsquo accelerate swift macos blas float

Related Articles