Running GLM-5.2 on a 64GB Mac, barely

I tried to run GLM-5.2 on a 64GB Mac - Andrea Borio

Andrea Borio

SubscribeSign in

I tried to run GLM-5.2 on a 64GB Mac Field notes from an experimental ds4 fork, a 244GB GGUF, and the small horror of sparse models that are sparse in compute but not very friendly to filesystems.

Andrea Borio Jun 23, 2026

I have a weakness for local LLM experiments that sound slightly unreasonable when said out loud. This one was: can GLM-5.2, a very large sparse MoE model, be made to run on a 64GB Apple Silicon Mac without turning the machine into a swap-powered space heater? Not run well. Not run like a proper inference server. Just run in a way that is measurable, repeatable, and not based on pretending that a 244GB quantized model somehow fits into 64GB of unified memory. The current answer is: It runs. Barely. Around 2 tokens/s in my warm-cache smokes, with no process swap.

That is not a product experience. It is also not nothing. The more interesting result is not the speed. The interesting result is where the bottleneck moved once the model stopped immediately falling over. The setup, in human terms

The runtime is based on antirez/ds4, but this GLM-5.2 work is in my experimental fork and branch: andreaborio/ds4, wip/glm52-metal64-strict-probe. The model artifact is here: andreaborio/glm52-ds4-native-64g-q2k-experimental. The base model is zai-org/GLM-5.2. The GGUF was produced by our glm-dsa quantization path on AWS from the public GLM-5.2 safetensors. It is ds4-native. I do not expect this file to be a generic GGUF that llama.cpp or other runtimes can just load. The machine was a 64GB Apple Silicon MacBook Pro. The model file is about 244 GiB. The routed MoE part alone is about 224 GiB. So the whole experiment is basically: keep the parts that must be resident in memory, stream the routed experts carefully, and try not to make macOS angry. Two caveats before anyone yells at me

First: when I say “Metal-required mixed backend path”, I do not mean “every operation is a pure GPU kernel”. The measured command uses Metal and an internal strict guard that refuses the run if the required Metal-backed path is not active. But the branch still has host-side scheduling, cache management, and file reads. This is not an upstream-ready “all GLM ops are GPU-backed” implementation. Second: when I report “0 block I/O”, that is the time(1) process counter on these warm runs. It is not a cold-cache, device-level SSD traffic measurement. The runtime logs also report miss_pread, which is logical data requested through the expert-loader pread path. Logical pread bytes are not the same thing as “the SSD physically read exactly this many bytes from NAND”. That distinction matters. This experiment is about file layout, cache policy, and runtime scheduling. It is not yet a storage benchmark. The weird part

Sparse MoE models are easy to describe in the happy version. For each token, you do not use all the experts. You route to a subset. Less compute. Great. But a file-backed runtime has a different problem: are the things you need near each other on disk, or are you constantly jumping around? For this GLM-5.2 GGUF, the answer is: the useful expert data is scattered in a pretty annoying way. The selected hot expert set I tested contains about 30 GiB of useful expert slices. In the source GGUF, those slices span about 240 GiB of file offsets. The cartoon version is even simpler: An expert triplet is about 11.8 MiB of useful data, but its gate/up/down tensors can be spread across about 2 GiB of file span.

That is the sentence that made the experiment click for me. GLM-5.2 is computationally sparse, but from the point of view of this streaming path it is also I/O-diffuse. You save arithmetic, then pay in locality. The current recipe

The best current recipe is boring in a good way: a flat selected-id hotlist. Record the experts the model actually selects. Keep a bounded hot set resident. Stream misses narrowly. Do not get clever too early. The best numbers below also use an optional sidecar pack. This sidecar is not a new model and not a reordered GGUF. It is a compact copy of the hottest expert slices for this experimental ds4 path. Think of it as a locality probe: “what happens if the hot expert slices are packed in a friendlier shape?” The answer is: it helps, but not dramatically. On my machine, with the sidecar: n=8: 2.52 tokens/s

n=32: 2.28 tokens/s

n=64: 2.01 tokens/s

All three warm runs stayed around 32.25GB max RSS and reported 0 process swaps. Again, this is not “fast”. This is “alive”. The fair comparison

The cleanest paired comparison is at n=32, using the same binary and the same flat selected-id hotlist. Without the sidecar, generation was 2.18 tokens/s. With the sidecar, generation was 2.28 tokens/s. The cache hit rate stayed the same. The logical pread bytes stayed the same. The gain came from reducing some loader locality overhead, especially begin_load_total. So I would not headline this as a heroic 1.98 ->...

Running GLM-5.2 on a 64GB Mac, barely

Related Articles

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI