Kog Laneformer 2B: The Latency-First Model Behind Kog Inference Engine

thomasjb1 pts0 comments

Kog Laneformer 2B: The Latency-First Model Behind Kog Inference Engine

Subscribe

Today Kog is releasing the weights and model code of Laneformer 2B on Hugging Face Hub, the 2.3B-parameter instruction-tuned coding model designed for high-speed decoding.

Most LLM research optimizes for benchmark quality first, and inference metrics like speed are often treated as a serving problem that comes later: train the model, then quantize it, shard it, batch inputs, cache inputs, and write better kernels.

Kog took a different route and treated speed as our first objective. What changes when a model is designed from the ground up with decoding speed maximization in mind? Which architectural choices does that rule out, and which ones still preserve strong model performance?

This blog post is the story of how Kog trained Laneformer 2B from scratch into a capable coding model while respecting the hardware constraints required by our Kog Inference Engine and the budget constraints of a startup.

About Kog

Kog is a Paris-based AI infrastructure startup building a real-time inference engine for AI agents with innovative low-level GPU engineering and LLM architecture research.

For more background, see Kog's website and introductory blog post:

Real-time LLM Inference on Standard Datacenter GPUs (3,000 tokens/s per request)

TL;DR

Kog designed a lane-structured Transformer architecture for high-speed single-request decoding on our inference stack.

Kog validated the custom architectural changes at small scale, then trained the final 2.3B model from scratch on ~4T pre-training tokens, continued on ~2T code/reasoning-heavy tokens, and instruction-tuned on ~210M tokens.

Kog shows that, even with moderate resources, it is possible to build and deploy a custom small language model with competitive coding benchmark results in its size range.

Laneformer 2B reaches 45.1% HumanEval+ and 51.6% MBPP+ in greedy decoding.

Kog releases the weights, Hugging Face model code and documentation as kogai‑laneformer‑2b‑it ↗

You can experience the accelerated version via our Kog Inference Engine on our playground ↗

The Laneformer 2B technical report is available on Hugging Face.<br>Read the full report ↗

The idea

At low batch sizes, decode speed is not just a FLOPs problem. A lot of time goes into moving weights, synchronizing kernels, and paying communication costs layer after layer.

This overhead increases even more in multi-GPU setups, where inter-GPU communication is introduced. At the model architecture level, Tensor Parallelism (TP) is a well-known way to split work across GPUs, but each layer forces the devices to stop and exchange results before moving on to the next layer.

This led us to a simple question: can we hide those communication costs instead of paying them at every layer?

Naive attempts to solve this problem can introduce ad hoc architectural changes that hurt model quality, and make the method difficult to apply to an existing pre-trained architecture without leaving performance on the table.

Fast inference does not require training a new model from scratch and Kog's inference engine already achieves very high decoding speeds on standard pre-trained architectures through low-level GPU optimization. But to go further, the runtime can no longer be treated as a separate serving layer: the model architecture itself has to expose the right structure for the engine to exploit.

Those observations left us with a single conclusion: for the fastest single-request inference, architecture and runtime should be designed together . Laneformer is our first model trained from scratch to explore that co-design point.

As a small startup, we could not solve this by scaling indefinitely yet. The target had to be deliberately constrained: design and train a small-scale model with strong coding capabilities and extreme inference decoding speed.

The story

Hiding overhead

Tensor Parallelism (TP) is effective because it splits large matrix operations across GPUs and pays it with inter-GPU synchronization. At batch-size-one decoding, this cost is especially painful.

The obvious idea is to delay the communication introduced by TP. In practice, doing this naively leads to subpar model quality: once hidden states are no longer synchronized at the usual boundaries, model quality starts to drop off sharply and finding architectural ideas becomes necessary for training stability and maintaining model quality.

We spent this phase testing variants at small scale. Interestingly, many of our more complex ideas either degraded quality or made the implementation too brittle. The useful lesson was almost embarrassingly simple: try the obvious thing first! Understand why it fails and fix it with only the most minimal architectural change needed.

That path led to the mechanism we now call Delayed Tensor Parallelism (DTP). For the full mechanism, see our DTP deep dive.

Designing the architecture

Once DTP had a viable shape, the rest of the...

model inference laneformer engine speed decoding

Related Articles