M* (M-Star): A Modular, Extensible, Serving System for Multimodal Models

matt_d2 pts0 comments

M* (M-star): A Modular, Extensible, Serving System for Multimodal Models

M* (M-star): A Modular, Extensible, Serving System for Multimodal Models

Today's models no longer fit the mold of autoregressive token generation, but the<br>systems supporting LLM inference have not kept up. These models have composite architectures best<br>captured by dataflow graphs . Requests are just walks on these<br>graphs. M* is designed to fit this paradigm and maximize flexibility and performance for current<br>and future composite models. In our tests, M* achieves nearly 2.7× higher<br>throughput vs. vLLM-Omni and 4× higher throughput vs. SGLang-Omni while<br>maintaining a lower RTF than both on the Qwen3-Omni TTS workload.

Atindra Jha, Naomi Sagan, Keisuke Kamahori, Xikai (Noah) Meng, Rohan Sanda, Luke Zettlemoyer, Olivia Hsu, Jure Leskovec, Baris Kasikci, Stephanie Wang

Stanford University · University of Washington · Carnegie Mellon University

atindra@cs.stanford.edu

June 2026

Read the paper arXiv<br>Code GitHub<br>Docs guide

decode<br>→ image

Textencoder<br>ViTencoder<br>LLMbackbone<br>image_genflow loop ×50<br>VAEdecoder

A request walks the graph.

Encoder<br>Backbone<br>Decoder

Inference is no longer a single loop

LLM serving systems like vLLM and SGLang are built on one assumption: that inference is a single<br>autoregressive loop — prefill the prompt, then decode one token at a time until the model stops.<br>The newest multimodal models break that assumption. Five families make it concrete:

UMMs — BAGEL

SpeechLMs — Orpheus

Omni — Qwen3-Omni

VLAs — π0.5

World models — V-JEPA 2

They are composite: built from structurally distinct components — vision encoders,<br>transformer backbones, diffusion and flow heads, audio codecs, action and world-model predictors<br>— wired together in patterns that change with the input. They add non-AR loops<br>(diffusion image generation, variable-horizon world-model rollouts), internal<br>parallelism (the branches of classifier-free guidance; the pipelined Thinker–Talker of<br>an omni model), and input-dependent paths (in BAGEL, generating an image and<br>understanding one traverse different components of the same model).

M* serves all of them from a single runtime. On the models we have benchmarked, M* matches or beats<br>the specialized system built for each — by up to 2.7× on speech and image<br>serving, and 12.5× on world-model rollouts. The rest of this post shows how M*<br>works, starting with code.

Figure 1. Two composite architectures as graphs of components —<br>BAGEL (a UMM: vit_encoder, vae_encoder, an LLM backbone,<br>vae_decoder) and Qwen3-Omni (an omni model: Thinker, Talker,<br>Code2Wav). Structurally diverse; each is naturally a graph.

Why today's serving stacks fall short

Composite models pose three challenges at once: architectural diversity (many<br>paths, non-AR loops), performant modularity (HuggingFace Transformers is flexible<br>but slow; vLLM and VoxServe are fast but domain-locked), and physical topology<br>(heterogeneous components want different placement, batching, and transport).

vLLM and SGLang are superb at autoregressive text, but they are<br>modality-locked : built for text generation, with image (and even text) inputs<br>supported only as prefill-time encoder add-ons, and a single decode loop whose output is always text.<br>There is no first-class way to compose heterogeneous components into loops and parallel branches<br>— no CFG fan-out — and no cross-component streaming. vLLM-Omni and<br>SGLang-Omni go further, modeling a request as a flat pipeline of stages wired by explicit<br>data-transfer functions — enough for a Thinker–Talker–codec chain. But iteration<br>stays inside a single stage and stages cannot be composed in parallel, so patterns such as diffusion<br>loops or classifier-free guidance (CFG) fan-out must be added per-model as glue code. In vLLM-Omni,<br>for instance, BAGEL's CFG runs through a bespoke plugin built on torch.distributed.

We built M* because we wanted to make it easier for current and future composite models to achieve<br>state-of-the-art efficiency. We found that current systems could be generalized into the M*<br>Walk Graph .

vLLM-OmniSGLang-OmniM* (ours)

Graph nodeEngine-instance stageWorker-pool stageModel component<br>CompositionFlat DAGFlat DAGSeq. / Par. / Loop / Stream<br>Paths per modelPrefill, decodePrefill, decodeFlexible<br>LoopsWithin a stageWithin a stageAcross any subgraph<br>PlacementStageStageComponent, w/ optional Walk

Table 1. Each prior abstraction is a restricted subset of the Walk Graph.

The Walk Graph, by example

In M*, a model is declared as a graph of model-component nodes connected by tensor<br>edges , plus a set of named Walks . Each Walk is a labeled subgraph for<br>one phase of behavior. A request is a series of Walks, chosen by a small state machine the<br>model author writes. The author provides only the graph and the Walks. Everything physical —<br>placement, scheduling, batching, tensor transport, streaming — is the runtime's job.

Figure 2. BAGEL in M*: its components as graph nodes (four core, plus<br>combine_cfg...

models omni model graph vllm serving

Related Articles