Wan Streamer v0.1: End-to-End Real-Time Interactive Foundation Models

Wan Streamer v0.1: End-to-end Real-time Interactive Foundation Models

Wan Streamer

v0.1

中文

Overview概览

Wan Streamer is a native-streaming, end-to-end interactive foundation model, designed from the ground up for real-time, low-latency, full-duplex audio-visual interaction. It models language, audio, and video as both input and output within a single Transformer : the sequence is an interleaving of visual, audio, and text input tokens with visual, audio, and text output tokens, coordinated by block-causal attention for incremental streaming.Wan Streamer 是一款原生流式、端到端的交互式基础模型，在架构上面向实时、低延迟、全双工的音视频交互设计。它在同一个 Transformer 中同时把语言、音频和视频作为输入与输出进行建模：视觉、音频、文本的输入 token 与输出 token 交错成一条序列，并通过 block-causal attention 协调，以实现增量式流式生成。

Overview of Wan Streamer. Language, audio, and video are modeled as both input and output within a single Transformer , coordinated by block-causal attention for incremental streaming generation. Wan Streamer 总体框架。在同一个 Transformer 中同时把语言、音频和视频作为输入与输出进行建模，并通过 block-causal attention 协调，以实现增量式流式生成。

To support natural audio-visual responsiveness, the entire stack is redesigned around streamability — causal encoders, causal decoders, block-causal attention, and low-latency multimodal token scheduling — enabling streaming units as short as 160 ms at 25 fps. Wan Streamer reaches roughly 200 ms model-side response latency, and about 550 ms total interaction latency once combined with 350 ms of bidirectional network latency, supporting sub-second duplex audio-visual communication.为了让音视频响应更自然，整个技术栈都围绕可流式性重新设计——因果编码器、因果解码器、block-causal attention 以及低延迟多模态 token 调度——在 25 fps 下，流式单元最短可达 160 ms。Wan Streamer 的模型侧响应延迟约为 200 ms ；计入 350 ms 的双向网络延迟后，总交互延迟约为 550 ms ，可支持亚秒级全双工音视频通信。

How it compares能力对比

Real-time interaction systems split into two camps. Speech-only systems answer fast but produce no visible agent — there is no synchronized face, gaze, or motion. Audio-visual systems do render an avatar, but they are assembled from external ASR, language, TTS, and animation modules, which adds latency at every boundary, and most never report an end-to-end response number. Wan Streamer is the only model that delivers a synchronized audio + video response from one end-to-end Transformer , and does so well under a second.实时交互系统大致分为两类。纯语音系统响应很快，却不会生成可见的智能体——没有同步的面孔、目光或动作。音视频系统能生成形象，但通常由外部 ASR、语言模型、TTS 和动画模块拼接而成，每个边界都会带来延迟，而且多数没有报告端到端响应时延。Wan Streamer 是唯一能通过单一端到端 Transformer 输出同步音视频回应的模型，并把总延迟控制在一秒以内。

Response latency响应时延 lower is better · seconds越低越好，单位：秒

End-to-end interaction loop端到端交互闭环perceive → respond感知 → 回应

Wan Streamerspeech + video语音 + 视频

0.2 · 0.55s

GPT-4o Realtimespeech语音

0.23 · ~0.8s

Doubao Voicespeech语音

0.7 · ~1.0s

Gemini Livespeech语音

1.2–3.6s

Rendering stage only只计渲染阶段external LLM / ASR / TTS excluded不含外部 LLM / ASR / TTS

LPM 1.0render-only仅渲染

~0.35s

OmniForcingrender-only仅渲染

~0.7s

Hallo-Liverender-only仅渲染

0.94s

StreamAvatarrender-only仅渲染

~1.2s

Model-side / first response模型侧 / 首次响应 Total, incl. network & pipeline总时延（含网络与流水线） Rendering only · excludes external brain只计渲染 · 不含外部语言模型

The two groups measure different things. The top group is full end-to-end interaction loops — they perceive the user and produce a response; only Wan Streamer also outputs video. The bottom group is avatar / audio-visual renderers timed at the rendering stage only : their latency excludes the external language model, ASR, and TTS they depend on, so their true user-visible latency is higher than shown. Wan Streamer is the only end-to-end model that outputs synchronized audio + video, and it does so under 0.6 s. Numbers are the closest publicly reported figures and mix measurement boundaries; see the paper for exact definitions. 上下两组的衡量口径不同。上方一组是完整的端到端交互闭环——感知用户并产生回应；其中只有 Wan Streamer 同时输出视频。下方一组是虚拟形象 / 音视频渲染器，只计时到渲染阶段：其时延不包含所依赖的外部语言模型、ASR 与 TTS，因此用户实际感受到的延迟会高于图中数值。Wan Streamer 是唯一端到端输出同步音视频的模型，且总时延在 0.6 秒以内。数值取各系统公开报告中最接近的口径，混合了不同测量边界，具体定义见论文。

System系统 Perceives video可感知视频 Outputs video可输出视频 Full-duplex全双工 End-to-end端到端 Sub-1s response一秒内响应

Wan Streamer ✓✓✓✓✓

Doubao Voice ✓✗✓✗~

GPT-4o Realtime ✓✗~✗✓

StreamAvatar ~✓~✗✗

LPM 1.0 ~✓✓✗~

✓ yes是 ~ partial / not disclosed部分支持 / 未公开 ✗ no否

Capability coverage across representative systems. Full-duplex means the system keeps perceiving while it generates — understanding and responding at the same time. Wan Streamer is the only model that perceives video, outputs synchronized video, runs full-duplex, is end-to-end, and responds within a second; every other system covers only part of this. "~" marks partial support or a figure that is not publicly disclosed. 代表性系统的能力覆盖对比。全双工指系统生成时仍在持续感知，也就是边理解边回应。Wan Streamer 是唯一同时具备视频感知、同步视频输出、全双工、端到端和一秒内响应能力的模型；其他系统都只覆盖其中一部分。“~” 表示部分支持或未公开的指标。

Agent demos角色演示

Each demo below is generated by the same model — a different person, voice, and scene. Open a clip to watch a prerecorded face-to-face interaction, with the user-side video shown in the corner. Clips are unedited model outputs rather than a live online...

Wan Streamer v0.1: End-to-End Real-Time Interactive Foundation Models

Related Articles

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

Britain Became as Poor as Mississippi