Squish – The fastest way to run local LLMs on Apple Silicon

jscohn852 pts0 comments

Squish: fast local LLMs on Apple Silicon - Squish

Skip to content

Initializing search

konjoai/squish

Installation

Quickstart

API Reference

Benchmarks

Guides

Architecture

Integrations

Reference

Contributing

Blog

Categories

v9.34.8 · Apple Silicon native

Squish<br>The Local AI<br>Agent Runtime.

Run any AI model, fully local, on Apple Silicon. Squish loads models in under a second—54&times; faster than the standard path—and serves them faster than Ollama. No cloud, no API keys, fully offline.

Install Free

GitHub

Free for personal use

macOS M1–M5

Runs fully offline

squish

The local AI agent runtime

terminal — squish

Install once

brew install konjoai/squish/squish

✓ squish 9.34.8 installed

One command does everything

squish run qwen2.5:7b

↓ Pulling qwen2.5:7b 4.0 GB

✓ Model ready 0.43s

✓ Chat open at http://localhost:11435 

150ms<br>TTFT (10k Context Loop)

100%<br>Perfect JSON (FSM Masking)

73%<br>Smaller Model Disk Size

4x<br>More Context via INT4 Cache

Drop-in<br>OpenAI & Ollama API

Getting started

Up and running in two steps

Install once. Then squish run handles pull, compress, serve, and opens your chat UI automatically.

Install Squish

One Homebrew command. No Docker, no CUDA, no virtual environment setup.

brew install konjoai/squish/squish

&rarr;

Run a Model

Downloads the pre-optimised model if needed, loads in milliseconds, opens your chat UI in the browser.

squish run qwen2.5:7b

squish serve is an alias for squish run — use whichever feels right.

Your data never<br>leaves your Mac

Every inference runs on your hardware, in your memory. No telemetry on conversations, no API quotas, no usage bills. Fast, private AI you own outright.

 Runs 100% locally<br> Works fully offline<br> Zero conversation logging<br> No API keys needed

No network round-trip

Everything runs on-device — no API rate limits, no per-token billing, no data leaving your Mac.

73% smaller on disk

INT4 compression turns a 16 GB BF16 8B model into 4.4 GB. Run two models where you used to fit one.

Accuracy within measurement noise

Calibrated quantisation holds benchmark accuracy to &le;1.5 pp across ARC-Easy, HellaSwag, WinoGrande, and PIQA at the tested sample size.

Gets better every release

Squish ships 100+ composable optimisation modules. Each release improves TTFT and decode throughput, applied automatically.

Features

Built for speed at every layer

From storage format to HTTP serving, every decision is optimised for Apple Silicon unified memory.

Sub-second cold start.

Memory-mapped INT4 tensors load directly into Metal unified memory with zero dtype conversion. A 1.5B model is ready in 0.33–0.53&thinsp;s — versus 28.8&thinsp;s for the standard loader, on 160&thinsp;MB of RAM.

54&times; faster load &middot; M3

Drop-in for OpenAI. Any OpenAI SDK.

Zero code changes. LangChain, LlamaIndex, OpenAI SDK, Cursor, and any tool that speaks /v1/chat/completions works out of the box.

/v1/chat/completions

10x faster on repeat prompts.

Agents resend the same long system prompt every turn. Squish's two-cache architecture reuses the prefill instead of re-running it—so a repeated prompt skips straight to decode.

4–11 ms TTFT on a cache hit

Zero broken JSON.

Small models hallucinate syntax. Squish uses engine-level Finite State Machine (FSM) masking to constrain every token to valid JSON matching your schema. Agents never crash a parser again.

Zero JSONDecodeErrors

4x more context. Same RAM.

A 32k context window normally pushes a 16&thinsp;GB Mac into swap. Squish's Asymmetric INT4 KV Cache shrinks the KV footprint by 75%, keeping all context hot in unified memory.

4x Context Capacity

Batch inference. One request.

Process multiple prompts in a single request. Essential for evals, data pipelines, and bulk generation—a capability Ollama and LM Studio don't offer.

"batch": [req1, req2, …]

Comparison

Why Squish beats the rest

Real measurements, same hardware. Apple M3 MacBook Pro, 16 GB — thermally controlled.

Metric<br>Ollama<br>LM Studio<br>Squish ✶

Cold start — load + first token<br>20–30 s<br>~18–28 s<br>0.5 s ✶

Decode throughput — 7B<br>20.3 tok/s<br>24.0 tok/s ✶

Inter-token tail latency (p95)<br>52 ms<br>43 ms ✶

Full response — 4000-token prompt<br>37.5 s<br>3.8 s 9.8&times; ✶

Peak RAM — serving<br>5.1 GB<br>3.5 GB ✶

Disk size — 7B INT4<br>4.4 GB (GGUF Q4)<br>4.7 GB (GGUF Q4)<br>4.0 GB INT4 ✶

OpenAI API

Batch requests

Pre-optimised weights (HuggingFace)<br>✓ 9 prebuilt

Auto-open chat UI

Zero-copy mmap Metal load

Repeat-prompt TTFT (KV cache hit)<br>~160 ms<br>4–11 ms ✶

Guaranteed JSON Syntax (FSM)<br>✓ 100% Reliable

Context Window Compression<br>FP16 Only (High VRAM)<br>FP16 Only<br>INT4 (75% Less VRAM)

✶ M3 16 GB, thermally controlled. Cold start: Qwen2.5-1.5B. Serving (decode, tail,<br>E2E, RAM): Qwen2.5-7B INT3 vs Ollama 0.30.7. Squish v9.34.8. On a loaded model, single-token<br>TTFT is comparable (Ollama 167 ms / Squish 192 ms) — Squish’s edge is everywhere else.

Quick Start

Everything you need, right here

 Install

 Run a model

 Chat UI

...

squish model install chat context int4

Related Articles