Optimizing Models to Be Fast at Codegen

gmays1 pts0 comments

Optimizing Models to Be Fast at Codegen | Morph

Book a Call

Sign Up / Log In

Back to Blog

An edit is mostly a copy of the file it edits. The agent rereads the same repo every turn. Its context this turn is mostly its context from last turn. A general inference stack throws all of that away and decodes every token like it has never seen anything before.

That waste is the opportunity. The weights are a free download. The speed is the product.

We serve open models, Qwen, GLM, DeepSeek, MiniMax, for one workload: the coding agent. Making them fast comes down to three things the open stack won't do for you.

Train the speculator. A draft trained on the model's own coding output, not the internet. Generic draft: 1.93x. Trained on the target: 3.07x.

Autoresearch the kernels. A kernel is correct or it isn't, so we search them automatically, on the cheap GPUs nobody else tunes for. 97 to 162 tok/s on a $7K card.

Write the interconnect. All-reduce over PCIe, and a prefix cache that crosses NVLink-denied boxes over plain TCP.

Each is a place the general stack stopped and we kept going.

1. We train the speculator. The open stack ships you an empty socket.

Speculative decoding: a small draft model guesses the next few tokens, the target checks them in one pass, you keep the run until the first miss. One number decides everything. Acceptance rate, how often the target keeps the guess.

A generic draft is a bad guesser. On Vicuna-13B an off-the-shelf 68M draft gets 1.93x; a draft trained on the target's own output gets 3.07x, same target, same setup. That gap is the section.

Speculative decodingdraft proposes, target verifies in one pass, keep the run until the first miss<br>Generic draft<br>68M, off-the-shelf

const x = fetch(✗ verify stops<br>1.93×

Trained on code<br>drafts diffs it has seen

const user = await db.get(✗ verify stops<br>3.07×

More accepted tokens per step means fewer target passes. A draft trained on the model's own coding output keeps a longer run than a generic one on the same target.

The architectures are public and good. EAGLE-3 lets the draft train on raw data instead of copying the target's features, and acceptance length climbs from 3.96 to 6.62. DFlash, SGLang's Spec V2 since June 2026, drafts a whole block in one pass: over 6x lossless, 3.2x on HumanEval where EAGLE-3 gets 2.2x.

But an architecture is an empty socket. Nobody hands you a drafter trained on your target, for your workload. You train it, or you run the generic one and eat the 1.93x.

Training a good drafter is small-model training, and that is the part we are good at. Fast Apply and Compact made us one of the best teams in the world at it. The thing you learn under 30B: the frontier scaling laws stop applying. Chinchilla says ~20 tokens per parameter is compute-optimal, but that assumes training is the cost. For a model you train once and serve billions of times, it isn't, and the optimum slides hard toward small and overtrained.

Llama 3: still improving at 15T tokens, two orders of magnitude past its Chinchilla point.

SmolLM2: a 1.7B model trained to 11T, near 6,500 tokens per parameter.

Sardana et al.: 47 models trained to 10,000 tokens per parameter, quality still climbing.

A speculator lives exactly there. Small, overtrained, shaped to one distribution.

So we train one per open model, on coding output instead of web text. Generated code reuses templates and the symbols already on screen, and an edit is mostly a copy of the file it edits. A draft that has read a million diffs predicts those tokens. One that read the internet doesn't, which is why code is the highest-speedup task for every speculation method. For Fast Apply we draft 64 tokens a step straight off input-output similarity: apply runs at 10,500 tok/s, compaction at 33,000. Same Qwen weights you can download. Ours is faster because the speculator riding it was trained, by us, on the work.

2. We autoresearch the kernels. Everyone else hand-tunes for H100s.

The agent's prompt barely changes between turns. Same system prompt, same tools, same repo, the same files read again. Across real workloads, programming traffic shares 97% of its prefix tokens, with prompts 37x to 2,494x longer than the outputs. Cache the prefix and the next request pays only for the new tokens. Hit rate is the cost.

The cache abstraction is open and we use it: RadixAttention holds prefixes in a tree, a cache-aware router takes hit rate from 20% to 75%, HiCache spills the tree to host RAM and remote storage and, on Qwen3-Coder-480B, moves hit rate from 40% to 80% and doubles throughput.

None of that is the hard part. The hard part is kernels. A cache only pays if the lookup, the eviction, the copy, and the attention over the tree are all fast on the GPU you actually run, and default kernels are tuned for the cards frontier labs buy. Port one across architectures without retuning and it runs at 7% of optimal. Reaching state of the art on AMD's MI250 took rewriting 40% of a flash-attention...

draft target tokens trained model fast

Related Articles