DeepSeek V4 Pro at 5% the cost of Claude — what it takes to close the gap
Howard’s Newsletter
SubscribeSign in
DeepSeek V4 Pro at 5% the cost of Claude — what it takes to close the gap<br>Hash-anchored edits, a sticky prefix cache, and the autonomous loops we run on production code
Howard Chen<br>Jun 16, 2026
Share
We’ve been using DeepSeek V4 Pro as our daily-driver coding model for a few months now, through a Go-based terminal harness we built called cwcode. Not for benchmarks. For actual work: training dose-prediction models for radiotherapy, building a financial research agent, writing the harness’s own code.<br>DeepSeek V4 Pro charges $0.435 per million input tokens, $0.003625 on cache hits, $0.87 on output. Claude Sonnet 4 sits around $3 / $0.30 / $15. Call it 5–7× cheaper across the board, with the cache spread even wider. If you trust the headline numbers on coding benchmarks, V4 Pro lands somewhere around 80–85% of Claude on long tasks. We’d put it at 90% in our actual workflow, but we had to build a lot of the gap-closing into the harness ourselves. None of the off-the-shelf agents we tried get there on V4 Pro.<br>This is the writeup of what it took. Where the gap is real, what closed it, and what we still can’t fix without a better model.<br>Where the gap is real
A few things V4 Pro just doesn’t do as well as Claude, and we don’t think harness work can fix:<br>- Long-horizon planning over unfamiliar code. Drop V4 Pro into a 50k-line codebase it’s never seen and ask it to refactor an architecture, and it’ll happily make four reasonable-looking edits that collectively don’t compile. Claude is noticeably better at holding the whole picture. We handle this with explicit Plan mode and keeping turns short.<br>- Reading sloppy code. Claude is more forgiving of weird naming, dead branches, and undocumented invariants. V4 Pro wants the code to make sense, and when it doesn’t, the model invents a sensible-looking version of what should be there.<br>- First-shot UI work. Claude’s first attempt at a React component is usually closer to “ready to ship.” V4 Pro is closer to “ready to iterate on.” For us, that’s fine; for someone using an agent to scaffold consumer apps, probably not.<br>Where V4 Pro is equal or better in practice:
- Following a precise spec. Give it `here’s the file, change line 47 to do X`, it does X. Faster than Claude, and at 5% the cost we don’t care about a few retries.<br>- Numerical and scientific code. This one surprised us. On our PyTorch training loops and Monte Carlo simulation glue, V4 Pro’s first attempts are noticeably more correct than Claude’s. We suspect the training mix.<br>- Bash and ops glue. Dead even.<br>So the harness’s job is to lean into V4 Pro’s strengths (precise execution against a clear spec) and structurally compensate for its weaknesses (planning, ambiguity tolerance). Most of what follows is one of those two things.<br>The single biggest harness change: hash-anchored editing
In February, Can Akay published [a post on coding-agent edit-tool design](https://blog.can.ac/2026/02/12/the-harness-problem/) that we think is the most important coding-agent paper of the year. Akay’s claim: most agent failures aren’t model failures, they’re harness failures, and specifically the harness’s edit tool. Asking a model to reproduce file content character-perfect — which is what `old_string` / `new_string` and patch-format both require — burns tokens on retries and makes weaker models look much worse than they are.<br>His proposed fix is what he calls “hashlines.” Annotate every line you show the model with a short content hash. Let the model edit by reference, not reproduction. Akay showed Grok Code Fast jumping from 6.7% to 68.3% on SWE-bench Verified just from this format change. Output tokens dropped 61% because the model wasn’t generating the same `old_string` block three times to land one edit.<br>We implemented this two weeks ago. Our `read_file` tool now returns:<br>1:5c2| package tools<br>2:a1f|<br>3:0eb| import “os”<br>4:dbd| import “strings”
Three hex chars per line, hash of the trailing-whitespace-trimmed content. The new `edit_lines` tool takes line ranges with the expected hashes at each endpoint:<br>“path”: “internal/tools/read.go”,<br>“edits”: [{<br>“from”: 3, “from_hash”: “0eb”,<br>“to”: 3, “to_hash”: “0eb”,<br>“new_text”: “import \”fmt\”“<br>}]
The harness reads the file fresh, recomputes hashes, and rejects the entire batch on any mismatch with a precise error:<br>edit_lines: line 3 hash mismatch — claimed “0eb”, actual “0e1”.<br>Current line: “import \”strings\”“
We didn’t deprecate the old exact-string `edit_file`. Sometimes the model wants to edit without doing a fresh read, and that’s fine. But `edit_lines` is the preferred path now, and the failure rate on the first attempt dropped sharply. On V4 Pro our internal numbers show roughly half the retries per task and 30–40% lower output tokens per session. Not a Grok-sized jump — V4 Pro was already better at exact-string matching than Grok — but enough that an autonomous...