Two Leaps to 1000 Tokens/s on a 1T-Parameter Model — TileRT
In recent years, the optimization of Large Language Model (LLM) inference systems has revolved almost entirely around three pillars: kernels, operators, and scheduling. A collective path dependency has formed across the industry—the belief that as long as we squeeze every ounce of performance out of a kernel and push FLOPs utilization to its absolute limit, the system will naturally approach the theoretical boundaries of the hardware.
For a long time, this logic was ironclad. The baseline bottleneck of system performance was dominated by “pure, brute-force computation.”
Execution overheads stemming from host launches, hardware synchronizations, runtime scheduling, and global memory transactions certainly existed, but they were naturally masked by dense, heavy compute blocks.
However, as inference enters the battlefield of ultra-low latency, the rules of the game have changed.
The proliferation of autonomous Agents, real-time code completion, voice interaction, and Test-Time Scaling is aggressively driving the infrastructure toward a latency-first era. Users are no longer just looking at aggregate throughput; they are counting every millisecond of responsiveness. When inference speed directly dictates the rollout budget, search depth, and overall quality of agentic interactions, those underlying execution overheads—once safely hidden behind massive compute cycles—are suddenly exposed on the critical path.
Against this backdrop, scaling the performance of flagship models from dozens of Tokens Per Second (TPS) to over 1000 TPS is not a mere linear improvement. It represents a fundamental departure from traditional optimization paradigms.
Dozens of TPS versus 1000+ TPS operate under entirely different dimensions of hardware reality.
The First Leap: The Execution Model Revolution
When we profile down to the bare metal, we find that the true bottleneck throttling performance is not that any single kernel runs too slowly. Rather, it is that the entire execution stream is constantly fractured at the microsecond scale by disjointed operator boundaries.
In traditional inference frameworks, a model is decomposed into a vast collection of isolated operators. Every single kernel launch carries inherent costs: Host-side launch latency, hardware synchronization, and round-trips to Global Memory. Under old compute-bound scenarios, these fixed costs were marginal. But when the system pushes for ultra-low latency—squeezing every microsecond out of a single Token—these gaps, once hidden, manifest as a glaring Execution Gap.
TileRT was architected from day one to eliminate these execution gaps born from operator boundaries.
Execution Gap — the boundaries of the traditional execution model.
To solve this, we introduced a completely different execution model. Instead of relying on the traditional ‘operator-by-operator launch’ pattern on the GPU, the entire computational pipeline is consolidated into a single, cohesive Persistent Engine that runs continuously inside the hardware.
The core benefit of this persistent paradigm is that it unlocks an end-to-end continuous prefetching capability. While a current Tile is actively processing inside the Compute Cores, subsequent data has already begun flowing ahead of time through the multi-level memory hierarchy—from Global Memory and Shared Memory straight into the registers.
Our Tile-level pipelining further dissects data movement, tensor computation, and communication into finer-grained physical Tiles, achieving much deeper overlap within the silicon.
Within this pipeline, Warp Specialization shatters the old rigid, serial execution pace, assigning dedicated Warp groups to distinct, coordinated roles. The introduction of Heterogeneous Workers then scales this specialization strategy beyond a single Streaming Multiprocessor (SM), extending it across the GPU’s entire execution domain.
Consequently, the GPU evolves from a traditionally homogenous parallel compute device into a continuously flowing, tightly orchestrated, heterogeneous execution system.
This is where the system transcends the boundaries of legacy execution abstractions. The leap from dozens of TPS to hundreds of TPS is not achieved by local optimizations of standalone kernels; it is powered by a fundamental paradigm shift in the execution model itself.
The Second Leap: Breaking the 1000 TPS Barrier
Microsecond-Scale Bottleneck Triage and Hardware-Software Co-Design
Reconstructing the execution model allows us to cross the first order-of-magnitude boundary, climbing from dozens to hundreds of TPS. For flagship models, this already means the system is rapidly encroaching upon the physical limits of the hardware.
But the real engineering cliff lies here: When a system is already sitting on top of the hardware ceiling, how do you double performance again to break 1000 TPS?
In this extreme performance regime,...