DeepSeek V4 Pro and Flash vs. Claude Opus 4.7 and Kimi K2.6

We Tested DeepSeek V4 Pro and Flash Against Claude Opus 4.7 and Kimi K2.6

Kilo Blog

SubscribeSign in

We Tested DeepSeek V4 Pro and Flash Against Claude Opus 4.7 and Kimi K2.6

Darko May 13, 2026

DeepSeek V4 Pro and DeepSeek V4 Flash launched together on April 24, 2026 under MIT license. They are DeepSeek’s first new architecture since V3, and their first open-weight lineup with two tiers (Pro as the flagship, Flash as the lightweight model).

We ran both through the same FlowGraph spec we used for Claude Opus 4.7 vs Kimi K2.6. With the same spec, same prompt, same scoring rubric. TL;DR: DeepSeek V4 Pro scored 77/100 for $2.25 and lands between Opus 4.7 (91) and Kimi K2.6 (68) in terms of performance. DeepSeek V4 Flash scored 60/100 for $0.02, a price point we have not seen on this test before, but its build failed and the output is missing some key pieces. The Four Models We Compared

DeepSeek V4 Flash is the cheapest model in the comparison by a wide margin. Output tokens cost less than 1/14th of Kimi K2.6 and roughly 1/89th of Claude Opus 4.7. DeepSeek is also running a 75% off promotion on DeepSeek V4 Pro through May 31, 2026. Under the discount, DeepSeek V4 Pro input drops to roughly $0.036/M and output drops to $0.87/M, putting it below Kimi K2.6 on both axes. DeepSeek separately cut input cache pricing across the lineup to one-tenth of previous levels as a permanent change. The Test

This is the same FlowGraph spec we used in the Opus 4.7 vs Kimi K2.6 run, a workflow orchestration backend with 20 endpoints, persistent state, lease management, retries, and event streaming. It is a heavier infrastructure test than our usual coding benchmarks to push the models to their limits. We ran DeepSeek V4 Pro and DeepSeek V4 Flash through the same setup to see where the new DeepSeek lineup lands on cost and first-pass quality next to Claude Opus 4.7 and Kimi K2.6. The Prompt

We ran both DeepSeek models in Kilo CLI with the same prompt we used for Opus 4.7 and Kimi K2.6: “Read @SPEC.md and build the project in the current directory. Treat @SPEC.md as the source of truth. Do not simplify this into a mock, toy app, or basic CRUD scaffold. Create all code, configuration, Prisma schema, tests, and README needed for a runnable project.…”

Both DeepSeek models ran on thinking mode in their own empty directories with no shared state. What Each Model Produced

DeepSeek V4 Pro passed its own test suite but the TypeScript build failed. DeepSeek V4 Flash’s test suite never ran because its setup script tried to force-reset the database in a way that errored out before the first test executed. If we had stopped at the model summaries, both DeepSeek implementations would look closer to Claude Opus 4.7 than they actually were. A direct code review plus targeted reproductions against isolated SQLite databases revealed the problems in both model outputs. DeepSeek V4 Pro

DeepSeek V4 Pro got the broad shape of the system right. The endpoints are wired up, the test suite passes, and the project layout is reasonable. The issues we found are concentrated in the same places as Kimi K2.6 : lease expiry handling, scheduling, validation, and build integrity.

Timed-out workers can still complete steps

When a worker claims a step, the system gives it a lease that expires after a set timeout. If the worker stalls or crashes, the lease should expire and another worker should be free to pick up the step. Once the lease has expired, the original worker is no longer the owner of that step and shouldn’t be able to mark it as done. DeepSeek V4 Pro enforces this on heartbeats but not on completions. We claimed a step, pushed its lease expiry into the past, then asked the API to mark the step as successfully completed. The API returned 200 and recorded the step as succeeded. The original worker effectively reached past its expired lease and finalized work it no longer owned. DeepSeek V4 Pro’s own README says workers cannot complete after their lease expires, but the implementation does not enforce that. A full workflow blocks unrelated work

A workflow run can declare a maximum number of steps it is allowed to run in parallel. When that cap is reached, the saturated run shouldn’t accept more work, but other runs sharing the same queue should keep moving. DeepSeek V4 Pro’s claim logic checks one candidate at a time. If that candidate happens to belong to a run that is already at its parallel cap, the function gives up and returns nothing, instead of moving on to the next candidate. We reproduced this with two active runs sharing a queue. Run A was at its parallel limit. Run B had capacity and a higher-priority step ready to go. The next claim request came back empty. In production this would look like workers idling while there is real work to do, just because the first run on the queue happens to be saturated. The project does not build

npm test passes but npm run build does not. Even after the build errors are fixed, the project...

DeepSeek V4 Pro and Flash vs. Claude Opus 4.7 and Kimi K2.6

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast