Wan Streamer v0.1: End-to-end Real-time Interactive Foundation Models
Wan Streamer
v0.1
中文
Overview概览
Wan Streamer is a native-streaming, end-to-end interactive foundation model, designed from the ground up for real-time, low-latency, full-duplex audio-visual interaction. It models language, audio, and video as both input and output within a single Transformer : the sequence is an interleaving of visual, audio, and text input tokens with visual, audio, and text output tokens, coordinated by block-causal attention for incremental streaming.Wan Streamer 是一款原生流式、端到端的交互式基础模型,在架构上面向实时、低延迟、全双工的音视频交互设计。它在同一个 Transformer 中同时把语言、音频和视频作为输入与输出进行建模:视觉、音频、文本的输入 token 与输出 token 交错成一条序列,并通过 block-causal attention 协调,以实现增量式流式生成。
Overview of Wan Streamer. Language, audio, and video are modeled as both input and output within a single Transformer , coordinated by block-causal attention for incremental streaming generation.<br>Wan Streamer 总体框架。 在同一个 Transformer 中同时把语言、音频和视频作为输入与输出进行建模,并通过 block-causal attention 协调,以实现增量式流式生成。
To support natural audio-visual responsiveness, the entire stack is redesigned around streamability — causal encoders, causal decoders, block-causal attention, and low-latency multimodal token scheduling — enabling streaming units as short as 160 ms at 25 fps. Wan Streamer reaches roughly 200 ms model-side response latency, and about 550 ms total interaction latency once combined with 350 ms of bidirectional network latency, supporting sub-second duplex audio-visual communication.为了让音视频响应更自然,整个技术栈都围绕可流式性重新设计——因果编码器、因果解码器、block-causal attention 以及低延迟多模态 token 调度——在 25 fps 下,流式单元最短可达 160 ms。Wan Streamer 的模型侧响应延迟约为 200 ms ;计入 350 ms 的双向网络延迟后,总交互延迟约为 550 ms ,可支持亚秒级全双工音视频通信。
How it compares能力对比
Real-time interaction systems split into two camps. Speech-only systems answer fast but produce no visible agent — there is no synchronized face, gaze, or motion. Audio-visual systems do render an avatar, but they are assembled from external ASR, language, TTS, and animation modules, which adds latency at every boundary, and most never report an end-to-end response number. Wan Streamer is the only model that delivers a synchronized audio + video response from one end-to-end Transformer , and does so well under a second.实时交互系统大致分为两类。纯语音 系统响应很快,却不会生成可见的智能体——没有同步的面孔、目光或动作。音视频 系统能生成形象,但通常由外部 ASR、语言模型、TTS 和动画模块拼接而成,每个边界都会带来延迟,而且多数没有报告端到端响应时延。Wan Streamer 是唯一能通过单一端到端 Transformer 输出同步音视频 回应的模型,并把总延迟控制在一秒以内。
Response latency响应时延<br>lower is better · seconds越低越好,单位:秒
End-to-end interaction loop端到端交互闭环perceive → respond感知 → 回应
Wan Streamerspeech + video语音 + 视频
0.2 · 0.55s
GPT-4o Realtimespeech语音
0.23 · ~0.8s
Doubao Voicespeech语音
0.7 · ~1.0s
Gemini Livespeech语音
1.2–3.6s
Rendering stage only只计渲染阶段external LLM / ASR / TTS excluded不含外部 LLM / ASR / TTS
LPM 1.0render-only仅渲染
~0.35s
OmniForcingrender-only仅渲染
~0.7s
Hallo-Liverender-only仅渲染
0.94s
StreamAvatarrender-only仅渲染
~1.2s
Model-side / first response模型侧 / 首次响应<br>Total, incl. network & pipeline总时延(含网络与流水线)<br>Rendering only · excludes external brain只计渲染 · 不含外部语言模型
The two groups measure different things. The top group is full end-to-end interaction loops — they perceive the user and produce a response; only Wan Streamer also outputs video. The bottom group is avatar / audio-visual renderers timed at the rendering stage only : their latency excludes the external language model, ASR, and TTS they depend on, so their true user-visible latency is higher than shown. Wan Streamer is the only end-to-end model that outputs synchronized audio + video, and it does so under 0.6 s. Numbers are the closest publicly reported figures and mix measurement boundaries; see the paper for exact definitions.<br>上下两组的衡量口径不同。上方一组是完整的端到端交互闭环——感知用户并产生回应;其中只有 Wan Streamer 同时输出视频。下方一组是虚拟形象 / 音视频渲染器,只计时到渲染阶段 :其时延不包含所依赖的外部语言模型、ASR 与 TTS,因此用户实际感受到的延迟会高于图中数值。Wan Streamer 是唯一端到端输出同步音视频的模型,且总时延在 0.6 秒以内。 数值取各系统公开报告中最接近的口径,混合了不同测量边界,具体定义见论文。
System系统<br>Perceives video可感知视频<br>Outputs video可输出视频<br>Full-duplex全双工<br>End-to-end端到端<br>Sub-1s response一秒内响应
Wan Streamer<br>✓✓✓✓✓
Doubao Voice<br>✓✗✓✗~
GPT-4o Realtime<br>✓✗~✗✓
StreamAvatar<br>~✓~✗✗
LPM 1.0<br>~✓✓✗~
✓ yes是<br>~ partial / not disclosed部分支持 / 未公开<br>✗ no否
Capability coverage across representative systems. Full-duplex means the system keeps perceiving while it generates — understanding and responding at the same time. Wan Streamer is the only model that perceives video, outputs synchronized video, runs full-duplex, is end-to-end, and responds within a second; every other system covers only part of this. "~" marks partial support or a figure that is not publicly disclosed.<br>代表性系统的能力覆盖对比。 全双工指系统生成时仍在持续感知,也就是边理解边回应。Wan Streamer 是唯一同时具备视频感知、同步视频输出、全双工、端到端和一秒内响应能力的模型;其他系统都只覆盖其中一部分。“~” 表示部分支持或未公开的指标。
Agent demos角色演示
Each demo below is generated by the same model — a different person, voice, and scene. Open a clip to watch a prerecorded face-to-face interaction, with the user-side video shown in the corner. Clips are unedited model outputs rather than a live online...