Running GLM-5.2 on a 64GB Mac, barely

andreaborio1 pts0 comments

I tried to run GLM-5.2 on a 64GB Mac - Andrea Borio

Andrea Borio

SubscribeSign in

I tried to run GLM-5.2 on a 64GB Mac<br>Field notes from an experimental ds4 fork, a 244GB GGUF, and the small horror of sparse models that are sparse in compute but not very friendly to filesystems.

Andrea Borio<br>Jun 23, 2026

Share

I have a weakness for local LLM experiments that sound slightly unreasonable when said out loud.<br>This one was: can GLM-5.2, a very large sparse MoE model, be made to run on a 64GB Apple Silicon Mac without turning the machine into a swap-powered space heater?<br>Not run well. Not run like a proper inference server. Just run in a way that is measurable, repeatable, and not based on pretending that a 244GB quantized model somehow fits into 64GB of unified memory.<br>The current answer is:<br>It runs. Barely. Around 2 tokens/s in my warm-cache smokes, with no process swap.

That is not a product experience. It is also not nothing.<br>The more interesting result is not the speed. The interesting result is where the bottleneck moved once the model stopped immediately falling over.<br>The setup, in human terms

The runtime is based on antirez/ds4, but this GLM-5.2 work is in my experimental fork and branch: andreaborio/ds4, wip/glm52-metal64-strict-probe.<br>The model artifact is here: andreaborio/glm52-ds4-native-64g-q2k-experimental. The base model is zai-org/GLM-5.2.<br>The GGUF was produced by our glm-dsa quantization path on AWS from the public GLM-5.2 safetensors. It is ds4-native. I do not expect this file to be a generic GGUF that llama.cpp or other runtimes can just load.<br>The machine was a 64GB Apple Silicon MacBook Pro. The model file is about 244 GiB. The routed MoE part alone is about 224 GiB.<br>So the whole experiment is basically: keep the parts that must be resident in memory, stream the routed experts carefully, and try not to make macOS angry.<br>Two caveats before anyone yells at me

First: when I say “Metal-required mixed backend path”, I do not mean “every operation is a pure GPU kernel”.<br>The measured command uses Metal and an internal strict guard that refuses the run if the required Metal-backed path is not active. But the branch still has host-side scheduling, cache management, and file reads. This is not an upstream-ready “all GLM ops are GPU-backed” implementation.<br>Second: when I report “0 block I/O”, that is the time(1) process counter on these warm runs. It is not a cold-cache, device-level SSD traffic measurement.<br>The runtime logs also report miss_pread, which is logical data requested through the expert-loader pread path. Logical pread bytes are not the same thing as “the SSD physically read exactly this many bytes from NAND”.<br>That distinction matters. This experiment is about file layout, cache policy, and runtime scheduling. It is not yet a storage benchmark.<br>The weird part

Sparse MoE models are easy to describe in the happy version.<br>For each token, you do not use all the experts. You route to a subset. Less compute. Great.<br>But a file-backed runtime has a different problem: are the things you need near each other on disk, or are you constantly jumping around?<br>For this GLM-5.2 GGUF, the answer is: the useful expert data is scattered in a pretty annoying way.<br>The selected hot expert set I tested contains about 30 GiB of useful expert slices. In the source GGUF, those slices span about 240 GiB of file offsets.<br>The cartoon version is even simpler:<br>An expert triplet is about 11.8 MiB of useful data, but its gate/up/down tensors can be spread across about 2 GiB of file span.

That is the sentence that made the experiment click for me.<br>GLM-5.2 is computationally sparse, but from the point of view of this streaming path it is also I/O-diffuse. You save arithmetic, then pay in locality.<br>The current recipe

The best current recipe is boring in a good way: a flat selected-id hotlist.<br>Record the experts the model actually selects. Keep a bounded hot set resident. Stream misses narrowly. Do not get clever too early.<br>The best numbers below also use an optional sidecar pack. This sidecar is not a new model and not a reordered GGUF. It is a compact copy of the hottest expert slices for this experimental ds4 path.<br>Think of it as a locality probe: “what happens if the hot expert slices are packed in a friendlier shape?”<br>The answer is: it helps, but not dramatically.<br>On my machine, with the sidecar:<br>n=8: 2.52 tokens/s

n=32: 2.28 tokens/s

n=64: 2.01 tokens/s

All three warm runs stayed around 32.25GB max RSS and reported 0 process swaps.<br>Again, this is not “fast”. This is “alive”.<br>The fair comparison

The cleanest paired comparison is at n=32, using the same binary and the same flat selected-id hotlist.<br>Without the sidecar, generation was 2.18 tokens/s.<br>With the sidecar, generation was 2.28 tokens/s.<br>The cache hit rate stayed the same. The logical pread bytes stayed the same. The gain came from reducing some loader locality overhead, especially begin_load_total.<br>So I would not headline this as a heroic 1.98 ->...

model file expert 64gb gguf tokens

Related Articles