I feed my coding agent JSON instead of screenshots

Why I feed my coding agent JSON instead of screenshots | SlimSnap

| Pixels > JSON | |__________________| \(•ᴗ•)/ | | _| |_

Why I feed my coding agent JSON instead of screenshots

by @bickov May 29, 2026 5 min read

Claude Code can read images. So can Cursor. So can ChatGPT. I built SlimSnap anyway, and the reason is boring: the image is the wrong shape for the job.

Here is the boring version. The job is "show the agent what is on my screen and have it act." For that job a retina screenshot pasted into a coding session is somewhere between expensive and lossy, depending on how much you care about your token budget, your context window, and the agent acting on the right thing.

The token math

A screenshot pasted to Claude as a vision input is downscaled and billed at the API's per-image cap: about 1,568 tokens on Sonnet and Haiku (the models Claude Code uses by default), up to 4,784 tokens on Opus 4.7 and 4.8. Pasted to Codex CLI (which runs on OpenAI's GPT-4o), a typical 1440x900 screenshot in high detail mode runs about 1,105 tokens. Pasted to Gemini CLI on Gemini 2.5, the same image is about 1,548 tokens. The same screen, turned into a SlimSnap JSON document, runs about 700 tokens. That JSON contains the elements, their normalized bounding boxes, their extracted colors, and the OCR text for each.

About 55 percent fewer than Sonnet or Gemini, 37 percent fewer than Codex, up to 85 percent fewer than Opus, per turn. And the only representation with structured intent the agent can act on.

Screenshot on Opus 4.7 / 4.8 (max billed) ~4,784 tokens

Screenshot on Sonnet / Haiku (max billed) ~1,568 tokens

Screenshot on Gemini 2.5 Pro / Flash (1440x900) ~1,548 tokens

Screenshot on Codex CLI / GPT-4o (1440x900, high detail) ~1,105 tokens

Same screen as SlimSnap JSON ~700 tokens

Per-turn billed tokens. Anthropic caps single images at 1,568 tokens on Sonnet and Haiku, 4,784 on Opus 4.7+. OpenAI's high-detail formula for a 1440x900 screenshot is 85 + 6 tiles x 170 = 1,105 tokens. Gemini 2.5 tiles by floor(min(w,h)/1.5), so a 1440x900 image is 6 tiles x 258 = 1,548 tokens. Cursor, Aider, GitHub Copilot, Cline, and Continue inherit whichever underlying model they're configured against (Claude or GPT-4o), so per-image costs match one of the bars above. SlimSnap JSON is the smallest single-screen representation across all of them, and the only one with structured intent.

For a one-shot question that is a curiosity. For the way I actually use a terminal coding agent, which is a long iterative session where I show it the page state every few prompts, it stops being a curiosity. Twenty turns of screenshots on Sonnet burns about 31k tokens of vision before you've said anything. Twenty turns on Codex CLI is about 22k. On Opus 4.7+ it is about 96k. Twenty turns of SlimSnap JSON is 14k. On a 200k context window, that is the difference between finishing the refactor and getting compacted out mid-session.

If you are running an agent all day, the bill matters. If you are running it on a tough refactor, the context matters more.

Structure beats pixels

The token math is the part that wins HN comments. The part that actually matters to whether the agent is helpful is structure.

When you paste a screenshot, the agent has to look at pixels and infer everything: what is a button, what is text, what color is what, what label belongs to what input, where the user is pointing. It does this every turn, because raw pixels are not persistent reasoning state. If you ask a follow-up six prompts later, the agent goes back to the pixels. (Why it sees rather than reads: Claude doesn't OCR your screenshot, it interprets it.)

When you paste structured JSON, the agent reads facts. Element e4 is a button, bbox [0.34, 0.60, 0.32, 0.07] normalized, color #3B82F6, OCR text "Sign up". The next turn it does not re-interpret pixels, it references e4. The reasoning is grounded in the same primitives the next turn will use.

Create your account Email

you@company.com Password

e1 e2 e3 e4

signup.json~700 tokens

"schema_version": "1.0", "captured_at": "2026-05-19T18:17:46Z", "screen": { "title": "Create your account", "app": "Safari" }, "image": { "width_px": 1440, "height_px": 900, "file": "signup.png" }, "elements": [ { "id": "e1", "type": "label", "value": "Create your account", "bbox": [0.34, 0.18, 0.32, 0.06] }, { "id": "e2", "type": "input", "value": "Email", "bbox": [0.34, 0.34, 0.32, 0.07] }, { "id": "e3", "type": "input", "value": "Password", "bbox": [0.34, 0.46, 0.32, 0.07] }, { "id": "e4", "type": "button", "value": "Sign up", "bbox": [0.34, 0.60, 0.32, 0.07], "color": "#3B82F6" } ], "estimated_tokens": 712

Left: what the user sees. Right: what the agent reads. Same screen, but the agent can reference e4 instead of re-interpreting pixels every turn.

Annotations carry the same property. A red rectangle in a PNG is a red rectangle. A red rectangle in SlimSnap JSON has an intent field, a target_ref pointing at the element it...

I feed my coding agent JSON instead of screenshots

Related Articles

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

How to Earn a Billion Dollars