An LLM on a Sony PSP

austinallegro1 pts0 comments

An LLM on a Sony PSP - Granda | Ideas & CodeAn LLM on a Sony PSP<br>2026-05-16 · 6 min read<br>A Sony PSP-2000 is a 333 MHz MIPS handheld from 2007 with 64 MB of RAM. Mine, this week, runs a 15-million-parameter Transformer and streams English text onto the LCD at one to two tokens per second.

The model is Karpathy&rsquo;s stories15M (a TinyStories checkpoint), int8-quantized to about 17 MB. The runtime is ~1100 lines of pure C cross-compiled with pspdev/pspdev in Docker. There is no Python, no libtorch, no helpful runtime on the device — the PSP is a single-process box that loads one EBOOT.PBP from the memory stick and gives you sceIo*, a framebuffer, and a VFPU. Everything else you build.<br>This post is the budget. Where every byte goes, what the kernels look like, and what&rsquo;s left on the table.<br>The hardware<br>CPUMIPS Allegrex @ 333 MHz, in-orderFPUscalar fp32 + a 4×4 VFPU (vector) coprocessorRAM64 MB (PSP-2000/3000); 32 MB on the original PSP-1000OSXMB, no virtual memory, no mmap, no swapOutput480×272 LCD, no stdout the host can readThe &ldquo;no mmap&rdquo; line is the one that bites. On a Linux box you&rsquo;d mmap the weights file and let the page cache handle it. On the PSP you have sceIoLseek + sceIoRead and a single malloc&rsquo;d arena. You read all 17 MB into RAM before forward pass #1, or you stream from the memory stick at roughly the speed of a USB 1.1 thumb drive and watch your throughput collapse.<br>The PSP-1000&rsquo;s 32 MB is not enough to leave heap room for the weights plus KV cache plus working buffers. The 2000 and 3000 ship with 64 MB. We need the 64.<br>The model<br>stories15M is the smallest of Karpathy&rsquo;s TinyStories checkpoints — 6 transformer layers, hidden size 288, 6 attention heads, vocab 32000. About fifteen million parameters total. At fp32, ~57 MB. At int8 q80 — symmetric per-group quantization, group size 64, one fp32 scale per group — ~17 MB.<br>Architecture: Llama-style decoder, RoPE, SwiGLU FFN<br>Layers: 6<br>Hidden: 288<br>Heads: 6 (head_dim 48)<br>Vocab: 32000<br>Context: 256 tokens<br>Quantization: int8 q80 (group=64, symmetric)<br>On-disk size: 17 MB<br>The model prep is its own Docker image: python:3.11-slim + cpu-only torch + a pinned commit of karpathy/llama2.c. It downloads stories15M.pt, runs export.py --version 2 to produce the q80 model.bin, builds the BPE tokenizer.bin, and — important — also builds Karpathy&rsquo;s runq.c reference with -ffp-contract=off -fno-fast-math and runs it on a fixed prompt to produce tests/expected.txt. That file is the byte-exact x86 reference the PSP gets diffed against. More on this in a minute.<br>The memory budget<br>24 MB of heap, declared once at module load:<br>PSP_HEAP_SIZE_KB(24576);

Spent as:<br>RegionSizeNotesWeights (int8 quantized)~17 MBsingle malloc&rsquo;d arena, slurped via sceIoLseek + sceIoRead chunksKV cache~3.5 MB6 layers × 256 ctx × 288 hidden × 2 (K+V) × fp32RunState working buffers~1 MBactivations, attention scores, sampled logitsStack, libc, framework~2 MBthe PSPSDK&rsquo;s overheadSlack~0.5 MBThe trick worth calling out: the token embedding table stays quantized in the arena. The naive port dequantizes it once at load time, which costs ~36 MB and immediately OOMs. Instead, on each forward we dequantize a single row — the row for the current token — into a small fp32 buffer. The cost is one extra dequant per forward; the win is ~36 MB we don&rsquo;t have.<br>The kernels<br>transformer.c is the usual suspects: rmsnorm, softmax, quantize/dequantize, matmul, RoPE, attention, SwiGLU, sampler. Each one is the textbook version with -ffp-contract=off forced so the order of multiply-add operations matches runq.c on x86. That matters for the test surface (see below).<br>The matmul today is scalar fp32 — three nested loops, one fp32 multiply-add at a time. On real hardware it gets ~1–2 tok/s. That&rsquo;s slow enough that a 64-token completion takes about a minute.<br>The matmul is also factored as a swappable function pointer. The v1 plan is a VFPU kernel that uses the 4×4 vector ops, which should hit ~5–15 tok/s on the same hardware. (The VFPU is the one piece of PSP hardware that ages well — a vector coprocessor with 128 registers addressable as eight 4×4 matrices, capable of dispatching a 4×4 matrix multiply in a single instruction.) That&rsquo;s a one-file change to drop in.<br>The UI<br>The PSP has a system on-screen keyboard you invoke via sceUtilityOsk*. It returns text as UTF-16LE; you convert it to UTF-8 (BMP only — the PSP&rsquo;s OSK doesn&rsquo;t reach into surrogate pairs) and feed it to the BPE tokenizer.<br>The chat UI is pspDebugScreen — the PSP&rsquo;s built-in debug font on the framebuffer. Monospace, 8×8 pixels, 60 columns × 34 rows on the 480×272 display. Two-color layout: the prompt at the top, generated tokens streaming below it character by character. When the buffer hits the bottom of the screen the rendering wraps. It&rsquo;s not pretty, but it&rsquo;s legible, and every character on the screen is something the model actually emitted.<br>The...

rsquo fp32 model sony karpathy int8

Related Articles