Snapcompact: SoTA Compaction — Instant, Local, Free. Pick 3 | Can.ac
⌘K
Distributed-Systems
Exploits
Software-Analysis
Software-Engineering
Web3
Windows-Internals
X86
Eval harness, font renderer, per-question records, white-box probes: omp — uv run final.py reproduces the API grid (~$35 cold, free from cache after); the representation runs need a local GPU with Qwen2.5-VL-7B.<br>Let’s start with the obligatory benchmark:<br>I am not a big fan of compaction. In every single harness, including my own, I’ve always felt like it “crippled” the model to the point where you would have been better off with a completely new session.<br>Eliding tool results is an okay alternative — instant, deterministic — but sometimes not really sufficient. It also occasionally confuses the model about tool calling. LLMs complete stories; if half of your story is [elided...], how confident do you think it will be about using them?<br>Handoffs are as good as it gets — but unlike a plan, you don’t usually steer the handoffs, and when you don’t, agents waste precious context writing an unnecessarily detailed diary, followed by a TODO list that practically begs the next agent to declare the goal impossible and ship an “MVP” instead.<br>My thinking was essentially that if you need compaction often, you’re doing something wrong: the plan either has scope creep, or should have been explicitly orchestrated via subagents so that the main agent could stay responsible for the entire scope.<br>However, spoiled by the 1M context window, these days I often hit the 500k mark by the end of a session — a mortal sin in my book a few months ago. But long-horizon tasks do better when one coherent agent drives the plan uninterrupted, and that easily reaches those levels even with aggressive delegation.<br>So there I was, staring at the 5h usage limit bar going red while this thing grinned back at me, thinking: maybe I should compact regularly…<br>Enter: snapcompact . It turns out the adage “A picture is worth a thousand words” is quite literally true.<br>A 1568×1568 PNG fits about 40,000 characters of text in a 6×10 pixel font. That’s ~10,000 tokens worth of text, billed by Anthropic’s pixel formula as 3,279. Do you see where I’m going with this?<br>This started as a joke (“free token glitch lol”). Then I benchmarked it, identified where it went wrong, cracked open Qwen’s attention layer, fixed the issues, benchmarked it again — and here I am writing it up, because it generalized remarkably well to frontier models.<br>0x0: A Stupid Experiment<br>It began with a 328KB session log and a simple question: what if I just printed this thing out and started the session with it?<br>Attempt one was maximally greedy: Tom Thumb, a 3×5 pixel font, 122,696 characters in a single image.<br>I sent it to a fresh agent session, zero explanation, and got back:<br>The image appears to be pure noise with random pixels, which suggests it might be corrupted or a file that’s been misnamed as PNG.
Fair. Attempt two used the X11 6x10 font (glyphs actually designed for that cell size), 40,716 characters, with each text row cycling through six colors. Same model, and there it was:<br>It identified the session’s topic and quoted me back verbatim .<br>It named 18 identifiers from the log with 100% recall.<br>Asked about a single assignment in the bottom-most row of the image, where the log cuts off, it hedged (“I’d be guessing — possibly 0”) — and guessed the state right.<br>10k tokens of text, carried by 3,279 image tokens, recalled with near-perfect precision. Okay. Now I’m invested.<br>0x1: Optimizing the Fonts<br>How small can the font go? I swept some font configurations and asked the model to transcribe fixed regions, scoring edit similarity against ground truth:<br>fontpx²/charchars/imagetranscriptionidentifiers read8×1310423,5201.0020/206×106040,7160.7920/205×84061,3480.3717/195×73570,1120.3010/204×624102,3120.029/20<br>The cliff is sharp and it sits around 35–40 px² per character . Above it, exact transcription degrades but identifier-level recall stays weirdly strong: the model can’t reproduce every byte, but it reads the names. Below it, nothing.<br>The funny thing is, this section was worse than useless — this exact optimization comes back to bite us in a bit.<br>0x2: Thinking…<br>Anecdotes about my own log don’t generalize, so let’s get a proper benchmark: SQuAD v1.1, extractive questions with gold answers. The harness packs passages into chunks sized to each technique’s carrying capacity, samples 30 questions per chunk spread evenly (so answers land at every image row, top to bottom), and runs every technique over the same corpus: text (passed verbatim), handoff (a simple handoff prompt), compact (provider-side compaction where available, a summarization call otherwise), and img-{font}-{variant} , where the variant is bw (plain black-on-white) or sent (glyph ink...