VideoFDB — Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents
Abstract
Natural human conversation is full-duplex and audio-visual: people simultaneously speak, listen, and signal through gaze, gesture, and affect. VideoFDB is the first benchmark to evaluate full-duplex audio-visual-to-audio-visual conversational agents — and we find that today's vision-speech models systematically miss the nonverbal turn.
Existing full-duplex benchmarks evaluate speech alone, while audio-visual benchmarks evaluate split-role or turn-based interaction. Neither captures the continuous, overlapping co-construction of meaning that defines natural dyadic conversation. VideoFDB closes that gap with 237 dyadic clips from real video calls spanning 11 nonverbal conversational dynamics , a taxonomy separating perception from generation, and a rubric-based LM-as-judge framework that scores agents along interpretable axes.
Evaluating leading open- and closed-source vision-speech agents, we find two systematic failure modes: captioning collapse (the model describes the user's appearance rather than conversing with them) and visual-stream ignorance (the audio-only and audio-visual outputs are paraphrases of each other). Cascaded speech-to-avatar pipelines preserve turn-yielding discipline but cannot insert nonverbal cues during the user's turn, with latencies 2.8–3.5 seconds behind human ground truth.
To protect participant identity, we do not visualize clip samples on this page. The full evaluation dataset is available on Hugging Face.
Leaderboard
Best and second-best non-human entries are in bold and underlined. Timing reports TOR-Alignment percentage above median latency below. Full per-dynamic breakdowns are in the paper appendix.
Perception<br>Generation
Model<br>Fluency ↑<br>Conv. Flow ↑<br>Vis. Ground. ↑<br>Overall ↑<br>Timing ↑
Human reference4.164.204.244.2090% / 1400 ms
Closed-source full-duplex speech-vision (AV2A)<br>Gemini 2.5 Flash Native3.332.813.373.1772% / 3160 ms<br>Gemini 3.1 Flash Live3.152.203.162.8466% / 1720 ms<br>OpenAI gpt-realtime-mini2.912.372.902.7366% / 5320 ms<br>OpenAI gpt-realtime2.722.503.022.7572% / 5400 ms
Open-source full-duplex speech-vision (AV2A)<br>MiniCPM-o 4.53.033.543.633.4073% / 720 ms<br>MiniOmni20.651.371.541.1964% / 3080 ms<br>VITA-1.51.191.572.531.7658% / 400 ms
Audio-only baselines (A2A; same agents, video withheld)<br>Gemini 2.5 Flash Native3.352.983.173.1773% / 2760 ms<br>Gemini 3.1 Flash Live3.402.643.033.0369% / 1240 ms<br>OpenAI gpt-realtime-mini3.052.483.122.8869% / 5000 ms<br>OpenAI gpt-realtime2.932.373.592.9767% / 4440 ms<br>MiniCPM-o 4.53.453.763.103.4472% / 920 ms<br>MiniOmni21.481.702.151.7269% / 2760 ms<br>VITA-1.51.621.373.022.0061% / 800 ms
Table 1. Performance breakdown across Perception rubrics. AV2A and A2A runs are paired on the same clips to isolate the visual contribution.
Model<br>Fluency ↑<br>Dyadic Affect ↑<br>NV Cue Approp. ↑<br>Overall ↑<br>Timing ↑
Human ground truth4.424.143.183.9278% / 900 ms<br>Cascaded speech-to-avatar (A2AV)<br>Gemini 2.5 + Anam3.483.211.712.8044% / 2840 ms<br>Gemini 2.5 + Keyframe3.432.601.132.3931% / 3520 ms
Table 2. Performance breakdown across Generation rubrics. Cascade architecture caps nonverbal-cue production well below human ground truth.
Submit your model's results on VideoFDB.
Contact us at amritam@nvidia.com with your model's per-sample outputs and we'll score them and produce a leaderboard row. We'll soon release an automated evaluation pipeline to make submission easier and more accessible.
Get the evaluation dataset
Contributions
A benchmark grounded in human communication science.<br>237 expert-annotated dyadic clips from natural video calls, spanning 11 conversational dynamics drawn from established interpersonal-communication research — turn-taking, backchannels, gaze aversion, adaptors, affect displays, and more.
A perception/generation taxonomy with interpretable rubrics.<br>Perception axes score whether the agent reads the situation; generation axes score whether the agent's own audio-visual output coheres. Three rubric axes per category, each scored 0–5 by an LM judge with stable cross-judge agreement (77–89% within 1 point).
Next steps for research:<br>We identify captioning collapse, visual-stream ignorance, and the lack of full-duplex avatar pipelines as key opportunities for future work in Interactive Agentic AI.
Motivation
Consider a brief pause in the middle of a sentence (Figure 1). An audio-only agent may treat it as a turn handoff and start speaking. But with both audio and video together, the same moment has more context: a shifted gaze and raised head can signal the user is still thinking, so the right response is to wait.
Figure 1. VideoFDB curates evaluation samples from natural two-person video calls and evaluates perception and generation across 11 dynamic categories.
What an agent does while the user is still speaking matters as much as what it says next and when. Most evaluations split dialogue into turns and focus on the response latency or...