Squish: fast local LLMs on Apple Silicon - Squish
Skip to content
Initializing search
konjoai/squish
Installation
Quickstart
API Reference
Benchmarks
Guides
Architecture
Integrations
Reference
Contributing
Blog
Categories
v9.34.8 · Apple Silicon native
Squish<br>The Local AI<br>Agent Runtime.
Run any AI model, fully local, on Apple Silicon. Squish loads models in under a second—54× faster than the standard path—and serves them faster than Ollama. No cloud, no API keys, fully offline.
Install Free
GitHub
Free for personal use
macOS M1–M5
Runs fully offline
squish
The local AI agent runtime
terminal — squish
Install once
brew install konjoai/squish/squish
✓ squish 9.34.8 installed
One command does everything
squish run qwen2.5:7b
↓ Pulling qwen2.5:7b 4.0 GB
✓ Model ready 0.43s
✓ Chat open at http://localhost:11435
150ms<br>TTFT (10k Context Loop)
100%<br>Perfect JSON (FSM Masking)
73%<br>Smaller Model Disk Size
4x<br>More Context via INT4 Cache
Drop-in<br>OpenAI & Ollama API
Getting started
Up and running in two steps
Install once. Then squish run handles pull, compress, serve, and opens your chat UI automatically.
Install Squish
One Homebrew command. No Docker, no CUDA, no virtual environment setup.
brew install konjoai/squish/squish
→
Run a Model
Downloads the pre-optimised model if needed, loads in milliseconds, opens your chat UI in the browser.
squish run qwen2.5:7b
squish serve is an alias for squish run — use whichever feels right.
Your data never<br>leaves your Mac
Every inference runs on your hardware, in your memory. No telemetry on conversations, no API quotas, no usage bills. Fast, private AI you own outright.
Runs 100% locally<br> Works fully offline<br> Zero conversation logging<br> No API keys needed
No network round-trip
Everything runs on-device — no API rate limits, no per-token billing, no data leaving your Mac.
73% smaller on disk
INT4 compression turns a 16 GB BF16 8B model into 4.4 GB. Run two models where you used to fit one.
Accuracy within measurement noise
Calibrated quantisation holds benchmark accuracy to ≤1.5 pp across ARC-Easy, HellaSwag, WinoGrande, and PIQA at the tested sample size.
Gets better every release
Squish ships 100+ composable optimisation modules. Each release improves TTFT and decode throughput, applied automatically.
Features
Built for speed at every layer
From storage format to HTTP serving, every decision is optimised for Apple Silicon unified memory.
Sub-second cold start.
Memory-mapped INT4 tensors load directly into Metal unified memory with zero dtype conversion. A 1.5B model is ready in 0.33–0.53 s — versus 28.8 s for the standard loader, on 160 MB of RAM.
54× faster load · M3
Drop-in for OpenAI. Any OpenAI SDK.
Zero code changes. LangChain, LlamaIndex, OpenAI SDK, Cursor, and any tool that speaks /v1/chat/completions works out of the box.
/v1/chat/completions
10x faster on repeat prompts.
Agents resend the same long system prompt every turn. Squish's two-cache architecture reuses the prefill instead of re-running it—so a repeated prompt skips straight to decode.
4–11 ms TTFT on a cache hit
Zero broken JSON.
Small models hallucinate syntax. Squish uses engine-level Finite State Machine (FSM) masking to constrain every token to valid JSON matching your schema. Agents never crash a parser again.
Zero JSONDecodeErrors
4x more context. Same RAM.
A 32k context window normally pushes a 16 GB Mac into swap. Squish's Asymmetric INT4 KV Cache shrinks the KV footprint by 75%, keeping all context hot in unified memory.
4x Context Capacity
Batch inference. One request.
Process multiple prompts in a single request. Essential for evals, data pipelines, and bulk generation—a capability Ollama and LM Studio don't offer.
"batch": [req1, req2, …]
Comparison
Why Squish beats the rest
Real measurements, same hardware. Apple M3 MacBook Pro, 16 GB — thermally controlled.
Metric<br>Ollama<br>LM Studio<br>Squish ✶
Cold start — load + first token<br>20–30 s<br>~18–28 s<br>0.5 s ✶
Decode throughput — 7B<br>20.3 tok/s<br>24.0 tok/s ✶
Inter-token tail latency (p95)<br>52 ms<br>43 ms ✶
Full response — 4000-token prompt<br>37.5 s<br>3.8 s 9.8× ✶
Peak RAM — serving<br>5.1 GB<br>3.5 GB ✶
Disk size — 7B INT4<br>4.4 GB (GGUF Q4)<br>4.7 GB (GGUF Q4)<br>4.0 GB INT4 ✶
OpenAI API
Batch requests
Pre-optimised weights (HuggingFace)<br>✓ 9 prebuilt
Auto-open chat UI
Zero-copy mmap Metal load
Repeat-prompt TTFT (KV cache hit)<br>~160 ms<br>4–11 ms ✶
Guaranteed JSON Syntax (FSM)<br>✓ 100% Reliable
Context Window Compression<br>FP16 Only (High VRAM)<br>FP16 Only<br>INT4 (75% Less VRAM)
✶ M3 16 GB, thermally controlled. Cold start: Qwen2.5-1.5B. Serving (decode, tail,<br>E2E, RAM): Qwen2.5-7B INT3 vs Ollama 0.30.7. Squish v9.34.8. On a loaded model, single-token<br>TTFT is comparable (Ollama 167 ms / Squish 192 ms) — Squish’s edge is everywhere else.
Quick Start
Everything you need, right here
Install
Run a model
Chat UI
...