Real-time LLM Inference on Standard GPUs (3,000 tokens/s per request)
Subscribe
TL;DR: we show that AI inference on GPUs can be super-fast, reaching the speed regime of dedicated inference hardware cards when optimizing the whole software stack with architecture/engine/kernel co-design. Test the speed in our live coding playground: playground.kog.ai.
This post explains why optimizing for single-request LLM decoding speed is important for AI agents; why it's primarily a memory-bandwidth maximization problem, not a FLOPS one; why standard datacenter GPU hardware has a much higher decoding-speed ceiling than current inference stacks expose due to software bottlenecks ; and how that ceiling can be reached (even on large MoE models) by co-designing the model architecture, runtime, and low-level GPU code as a single latency-optimized pipeline.
Our public tech preview is about proving that extremely fast single-request decoding is possible on the standard datacenter GPUs enterprises already own — including AI labs and sovereign-AI buyers. The limiting factor has been that existing inference software stacks are not optimized for this type of workload. Opening the GPU path could deliver that speed without the lock-in of proprietary silicon.
You can test the speed of our 2B coding model today. It's small and not a frontier model (we've been focused on speed rather than scale), though still quite capable when fine-tuned for specific software engineering tasks.
What autonomous agents change: single-request decode speed is now the metric that matters
Inference benchmarks typically conflate three quantities. Aggregate throughput (total tokens generated per second across all users) measures server utilization and rewards large batches. Time to first token measures prefill latency. Decode speed per request measures token generation speed and defines how long one user waits before receiving the full response. That last one governs every long serial interaction, and it's what AI agents are bottlenecked on.
Agentic software engineering is a sequential loop : inspect, plan, edit, test, revise. Each step depends on the previous one. Tool time sometimes dominates, as tests have to run and web pages have to load, but the generation-heavy steps (planning, code writing, trace analysis, debugging, refactoring) set the loop rate. And reasoning tokens compound on top.
The numbers translate directly into product and user experience. If an agent needs to generate 50,000 tokens in a workflow, 100 tokens/s is roughly eight minutes; 3,000 tokens/s is under twenty seconds. The difference changes the product that can be built.
As agents become more autonomous, the productivity frontier shifts from intelligence alone to intelligence × iteration speed . The best agents will generate more useful tokens, reason more, and perform more tool calls, tests, and revisions inside the same wall-clock budget.
This is why Kog optimizes single-request latency first, and why this preview runs at batch size 1. Large batches do matter and we will support them in production, but they answer a different question.
But what is limiting decode speed on GPUs?
Memory bandwidth is the primary bottleneck for fast token generation (and GPU nodes have plenty)
At batch size 1, autoregressive decoding is dominated by matrix-vector work. For each generated token, all the active weights of the model must move through the memory hierarchy inside the GPU, from HBM to compute processors. Thus, a first-order bound is:
tokens/s ≤ effective_memory_bandwidth<br>/ (β × active_weight_bytes + KV cache)
where β can be greater than one when tiles are reloaded or cache reuse is imperfect.
The key fact is that low-batch decode has very low arithmetic intensity. In FP16, a model weight occupies two bytes and contributes roughly one multiply-add (two FLOPs) which is about 1 FLOP/byte . FP8 raises it to ~2 FLOPs/byte; FP4 to ~4. However, modern AI GPUs expose hundreds of peak FLOPs per byte of HBM bandwidth. NVIDIA's H200, for example, claims a peak balance of roughly 400 FLOPs/byte. Thus, token generation speed is capped by memory bandwidth before being limited by FLOPS.
This is why Memory Bandwidth Utilization (MBU) is the central metric for single-request speed, not Model FLOP Utilization (MFU). MFU can still be improved by batching several requests together, which can however increase the latency experienced by each user as more KV cache data needs to be streamed inside the GPU.
For batch-size-1 decode, more memory bandwidth equals more tokens generated per second. The good news is that memory bandwidth of GPUs is already very high. An 8× NVIDIA H200 node exposes roughly 30.7 TB/s of effective aggregate memory bandwidth (taking 80% of the 4.8 TB/s theoretical per GPU as a realistic ceiling). An 8× AMD MI300X node reaches about 33.6 TB/s in practice (assuming 4.2 TB/s achievable per GPU).
Let's take a 2B-parameter dense model in FP16 as an example. It has roughly 4 GB...