An LLM on a Sony PSP - Granda | Ideas & CodeAn LLM on a Sony PSP<br>2026-05-16 · 6 min read<br>A Sony PSP-2000 is a 333 MHz MIPS handheld from 2007 with 64 MB of RAM. Mine, this week, runs a 15-million-parameter Transformer and streams English text onto the LCD at one to two tokens per second.
The model is Karpathy’s stories15M (a TinyStories checkpoint), int8-quantized to about 17 MB. The runtime is ~1100 lines of pure C cross-compiled with pspdev/pspdev in Docker. There is no Python, no libtorch, no helpful runtime on the device — the PSP is a single-process box that loads one EBOOT.PBP from the memory stick and gives you sceIo*, a framebuffer, and a VFPU. Everything else you build.<br>This post is the budget. Where every byte goes, what the kernels look like, and what’s left on the table.<br>The hardware<br>CPUMIPS Allegrex @ 333 MHz, in-orderFPUscalar fp32 + a 4×4 VFPU (vector) coprocessorRAM64 MB (PSP-2000/3000); 32 MB on the original PSP-1000OSXMB, no virtual memory, no mmap, no swapOutput480×272 LCD, no stdout the host can readThe “no mmap” line is the one that bites. On a Linux box you’d mmap the weights file and let the page cache handle it. On the PSP you have sceIoLseek + sceIoRead and a single malloc’d arena. You read all 17 MB into RAM before forward pass #1, or you stream from the memory stick at roughly the speed of a USB 1.1 thumb drive and watch your throughput collapse.<br>The PSP-1000’s 32 MB is not enough to leave heap room for the weights plus KV cache plus working buffers. The 2000 and 3000 ship with 64 MB. We need the 64.<br>The model<br>stories15M is the smallest of Karpathy’s TinyStories checkpoints — 6 transformer layers, hidden size 288, 6 attention heads, vocab 32000. About fifteen million parameters total. At fp32, ~57 MB. At int8 q80 — symmetric per-group quantization, group size 64, one fp32 scale per group — ~17 MB.<br>Architecture: Llama-style decoder, RoPE, SwiGLU FFN<br>Layers: 6<br>Hidden: 288<br>Heads: 6 (head_dim 48)<br>Vocab: 32000<br>Context: 256 tokens<br>Quantization: int8 q80 (group=64, symmetric)<br>On-disk size: 17 MB<br>The model prep is its own Docker image: python:3.11-slim + cpu-only torch + a pinned commit of karpathy/llama2.c. It downloads stories15M.pt, runs export.py --version 2 to produce the q80 model.bin, builds the BPE tokenizer.bin, and — important — also builds Karpathy’s runq.c reference with -ffp-contract=off -fno-fast-math and runs it on a fixed prompt to produce tests/expected.txt. That file is the byte-exact x86 reference the PSP gets diffed against. More on this in a minute.<br>The memory budget<br>24 MB of heap, declared once at module load:<br>PSP_HEAP_SIZE_KB(24576);
Spent as:<br>RegionSizeNotesWeights (int8 quantized)~17 MBsingle malloc’d arena, slurped via sceIoLseek + sceIoRead chunksKV cache~3.5 MB6 layers × 256 ctx × 288 hidden × 2 (K+V) × fp32RunState working buffers~1 MBactivations, attention scores, sampled logitsStack, libc, framework~2 MBthe PSPSDK’s overheadSlack~0.5 MBThe trick worth calling out: the token embedding table stays quantized in the arena. The naive port dequantizes it once at load time, which costs ~36 MB and immediately OOMs. Instead, on each forward we dequantize a single row — the row for the current token — into a small fp32 buffer. The cost is one extra dequant per forward; the win is ~36 MB we don’t have.<br>The kernels<br>transformer.c is the usual suspects: rmsnorm, softmax, quantize/dequantize, matmul, RoPE, attention, SwiGLU, sampler. Each one is the textbook version with -ffp-contract=off forced so the order of multiply-add operations matches runq.c on x86. That matters for the test surface (see below).<br>The matmul today is scalar fp32 — three nested loops, one fp32 multiply-add at a time. On real hardware it gets ~1–2 tok/s. That’s slow enough that a 64-token completion takes about a minute.<br>The matmul is also factored as a swappable function pointer. The v1 plan is a VFPU kernel that uses the 4×4 vector ops, which should hit ~5–15 tok/s on the same hardware. (The VFPU is the one piece of PSP hardware that ages well — a vector coprocessor with 128 registers addressable as eight 4×4 matrices, capable of dispatching a 4×4 matrix multiply in a single instruction.) That’s a one-file change to drop in.<br>The UI<br>The PSP has a system on-screen keyboard you invoke via sceUtilityOsk*. It returns text as UTF-16LE; you convert it to UTF-8 (BMP only — the PSP’s OSK doesn’t reach into surrogate pairs) and feed it to the BPE tokenizer.<br>The chat UI is pspDebugScreen — the PSP’s built-in debug font on the framebuffer. Monospace, 8×8 pixels, 60 columns × 34 rows on the 480×272 display. Two-color layout: the prompt at the top, generated tokens streaming below it character by character. When the buffer hits the bottom of the screen the rendering wraps. It’s not pretty, but it’s legible, and every character on the screen is something the model actually emitted.<br>The...