M* (M-Star): A Modular, Extensible, Serving System for Multimodal Models

M* (M-star): A Modular, Extensible, Serving System for Multimodal Models

Today's models no longer fit the mold of autoregressive token generation, but the systems supporting LLM inference have not kept up. These models have composite architectures best captured by dataflow graphs . Requests are just walks on these graphs. M* is designed to fit this paradigm and maximize flexibility and performance for current and future composite models. In our tests, M* achieves nearly 2.7× higher throughput vs. vLLM-Omni and 4× higher throughput vs. SGLang-Omni while maintaining a lower RTF than both on the Qwen3-Omni TTS workload.

Atindra Jha, Naomi Sagan, Keisuke Kamahori, Xikai (Noah) Meng, Rohan Sanda, Luke Zettlemoyer, Olivia Hsu, Jure Leskovec, Baris Kasikci, Stephanie Wang

Stanford University · University of Washington · Carnegie Mellon University

atindra@cs.stanford.edu

June 2026

Read the paper arXiv Code GitHub Docs guide

decode → image

Textencoder ViTencoder LLMbackbone image_genflow loop ×50 VAEdecoder

A request walks the graph.

Encoder Backbone Decoder

Inference is no longer a single loop

LLM serving systems like vLLM and SGLang are built on one assumption: that inference is a single autoregressive loop — prefill the prompt, then decode one token at a time until the model stops. The newest multimodal models break that assumption. Five families make it concrete:

UMMs — BAGEL

SpeechLMs — Orpheus

Omni — Qwen3-Omni

VLAs — π0.5

World models — V-JEPA 2

They are composite: built from structurally distinct components — vision encoders, transformer backbones, diffusion and flow heads, audio codecs, action and world-model predictors — wired together in patterns that change with the input. They add non-AR loops (diffusion image generation, variable-horizon world-model rollouts), internal parallelism (the branches of classifier-free guidance; the pipelined Thinker–Talker of an omni model), and input-dependent paths (in BAGEL, generating an image and understanding one traverse different components of the same model).

M* serves all of them from a single runtime. On the models we have benchmarked, M* matches or beats the specialized system built for each — by up to 2.7× on speech and image serving, and 12.5× on world-model rollouts. The rest of this post shows how M* works, starting with code.

Figure 1. Two composite architectures as graphs of components — BAGEL (a UMM: vit_encoder, vae_encoder, an LLM backbone, vae_decoder) and Qwen3-Omni (an omni model: Thinker, Talker, Code2Wav). Structurally diverse; each is naturally a graph.

Why today's serving stacks fall short

Composite models pose three challenges at once: architectural diversity (many paths, non-AR loops), performant modularity (HuggingFace Transformers is flexible but slow; vLLM and VoxServe are fast but domain-locked), and physical topology (heterogeneous components want different placement, batching, and transport).

vLLM and SGLang are superb at autoregressive text, but they are modality-locked : built for text generation, with image (and even text) inputs supported only as prefill-time encoder add-ons, and a single decode loop whose output is always text. There is no first-class way to compose heterogeneous components into loops and parallel branches — no CFG fan-out — and no cross-component streaming. vLLM-Omni and SGLang-Omni go further, modeling a request as a flat pipeline of stages wired by explicit data-transfer functions — enough for a Thinker–Talker–codec chain. But iteration stays inside a single stage and stages cannot be composed in parallel, so patterns such as diffusion loops or classifier-free guidance (CFG) fan-out must be added per-model as glue code. In vLLM-Omni, for instance, BAGEL's CFG runs through a bespoke plugin built on torch.distributed.

We built M* because we wanted to make it easier for current and future composite models to achieve state-of-the-art efficiency. We found that current systems could be generalized into the M* Walk Graph .

vLLM-OmniSGLang-OmniM* (ours)

Graph nodeEngine-instance stageWorker-pool stageModel component CompositionFlat DAGFlat DAGSeq. / Par. / Loop / Stream Paths per modelPrefill, decodePrefill, decodeFlexible LoopsWithin a stageWithin a stageAcross any subgraph PlacementStageStageComponent, w/ optional Walk

Table 1. Each prior abstraction is a restricted subset of the Walk Graph.

The Walk Graph, by example

In M*, a model is declared as a graph of model-component nodes connected by tensor edges , plus a set of named Walks . Each Walk is a labeled subgraph for one phase of behavior. A request is a series of Walks, chosen by a small state machine the model author writes. The author provides only the graph and the Walks. Everything physical — placement, scheduling, batching, tensor transport, streaming — is the runtime's job.

Figure 2. BAGEL in M*: its components as graph nodes (four core, plus combine_cfg...

M* (M-Star): A Modular, Extensible, Serving System for Multimodal Models

Related Articles

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

German ruling declares Google liable for false answers in AI Overviews

Britain Became as Poor as Mississippi