Xiaomi MiMo, Explore and Love
Blog<br>Join us
English
简体中文
Blog
Join us
English
简体中文
June 8, 2026MiMo-V2.5-Pro-UltraSpeed: Pushing 1T-Parameter Model Generation Speed to 1000 TPS
1. Xiaomi MiMo-V2.5-Pro-UltraSpeed: Speed is the Ultimate Edge<br>From the first roaring racer of the combustion age to the sonic boom that shattered the sound barrier, humanity's hunger for speed is written into our very DNA. The speed of AI reasoning is no different — it defines the boundaries of intelligence itself. When a model is fast enough, it ceases to be a tool you wait on and becomes an extension of your own thinking: responding in real time, iterating in an instant, collaborating without friction.<br>Today, we are thrilled to release Xiaomi MiMo-V2.5-Pro-UltraSpeed in collaboration with TileRT, breaking the 1000 tokens/s decode speed on a 1-trillion-parameter model for the first time!<br>MiMo-V2.5-Pro UltraSpeed real-time generation speed comparison (up to ~1200 tokens/s)<br>2. Limited-Time Access · Application-Based<br>The MiMo-V2.5-Pro-UltraSpeed API launches simultaneously at a limited-time promotional price — 3× the cost of MiMo-V2.5-Pro, but delivering approximately 10× the generation speed! 3× the price, 10× the output experience. (API only; Token Plan not supported.)<br>Due to limited high-speed inference resources, MiMo-V2.5-Pro-UltraSpeed will be available through an application-based, limited-time window. Approved users can access the API during the trial period, available only from June 9 to June 23, 2026, 23:59 (Beijing Time, UTC+8 / 08:59 PDT) .<br>How to Apply<br>API platform: platform.xiaomimimo.com/ultraspeed. Trial slots are limited — submission does not guarantee approval. We will prioritize enterprises and professional developers with genuine business needs. For standard model access, please follow the MiMo-V2.5 model series. For in-depth business partnerships for the UltraSpeed model, contact business-mimo@xiaomi.com.<br>Chat Experience (Free During Trial)<br>Approved users will receive free Chat access valid within the two-week window. Entry point: ultraspeed.xiaomimimo.com<br>To ensure quality and fairness under resource constraints, the following rules apply: each account may enter the queue up to 10 times per day; each session is capped at 30 minutes; sessions idle for more than 5 minutes will be automatically released.
3. 1000 tokens/s: Not Just Fast, But a Paradigm Shift<br>At the trillion-parameter (1T) scale, breaking 1000 tps is far more than a faster typewriter — it fundamentally disrupts AI application paradigms.<br>First, speed itself begins to transmute into intelligence. Previously, when facing a hard problem, you could only "wait for one answer and pray it's correct." Now, within the same wall-clock time, the model can run dozens of reasoning paths in parallel (Best-of-N / Tree Search), automatically verifying and self-correcting in the background — using raw speed to generate depth of thought, directly elevating reasoning quality.<br>Second, it completely unleashes the productivity ceiling of Coding Agents. Before, having AI write code meant developers painfully waiting in front of screens, bottlenecked by inference latency. At 1000 tps, code generation speed and production efficiency undergo a paradigm-level acceleration.<br>Most importantly, trillion-parameter models can now enter real-time decision loops. Millisecond-level "think-respond" cycles allow 1T flagship models to seamlessly plug into time-critical scenarios — high-frequency quantitative trading signal generation, instant anti-fraud interception, intelligent bidding, and real-time interactive dialogue. And when this power is brought to surgical assistance and medical imaging analysis in life-or-death situations, AI speed is no longer just a metric of efficiency — it becomes a chip in the race against death. On the operating table, every second AI saves in completing lesion analysis and risk prediction gives the surgeon one more degree of freedom. This deepens our conviction that the ultimate significance of speed is not merely boosting productivity, but enabling technology to help humanity live better.
4. Extreme Model-System Codesign<br>Achieving 1000+ tokens/s generation speed with a 1T flagship model is not the breakthrough of a single technique — it is the product of deep collaboration and extreme Codesign between the MiMo model team and the TileRT system team. The industry's current approach to similar extreme speeds typically relies on specialized hardware — Cerebras's Wafer-Scale integration or Groq's pure on-chip SRAM custom architecture. We chose a different path: achieving even more impressive inference speed on commodity GPUs through model-system codesign alone.<br>On the model side, we applied FP4 quantization targeting the bandwidth bottleneck of commodity hardware, dramatically shrinking model size and reducing memory-access overhead; simultaneously, we introduced DFlash, an efficient speculative decoding method based on block-level masked parallel...