VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Agents

VideoFDB — Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

Abstract

Natural human conversation is full-duplex and audio-visual: people simultaneously speak, listen, and signal through gaze, gesture, and affect. VideoFDB is the first benchmark to evaluate full-duplex audio-visual-to-audio-visual conversational agents — and we find that today's vision-speech models systematically miss the nonverbal turn.

Existing full-duplex benchmarks evaluate speech alone, while audio-visual benchmarks evaluate split-role or turn-based interaction. Neither captures the continuous, overlapping co-construction of meaning that defines natural dyadic conversation. VideoFDB closes that gap with 237 dyadic clips from real video calls spanning 11 nonverbal conversational dynamics , a taxonomy separating perception from generation, and a rubric-based LM-as-judge framework that scores agents along interpretable axes.

Evaluating leading open- and closed-source vision-speech agents, we find two systematic failure modes: captioning collapse (the model describes the user's appearance rather than conversing with them) and visual-stream ignorance (the audio-only and audio-visual outputs are paraphrases of each other). Cascaded speech-to-avatar pipelines preserve turn-yielding discipline but cannot insert nonverbal cues during the user's turn, with latencies 2.8–3.5 seconds behind human ground truth.

To protect participant identity, we do not visualize clip samples on this page. The full evaluation dataset is available on Hugging Face.

Leaderboard

Best and second-best non-human entries are in bold and underlined. Timing reports TOR-Alignment percentage above median latency below. Full per-dynamic breakdowns are in the paper appendix.

Perception Generation

Model Fluency ↑ Conv. Flow ↑ Vis. Ground. ↑ Overall ↑ Timing ↑

Human reference4.164.204.244.2090% / 1400 ms

Closed-source full-duplex speech-vision (AV2A) Gemini 2.5 Flash Native3.332.813.373.1772% / 3160 ms Gemini 3.1 Flash Live3.152.203.162.8466% / 1720 ms OpenAI gpt-realtime-mini2.912.372.902.7366% / 5320 ms OpenAI gpt-realtime2.722.503.022.7572% / 5400 ms

Open-source full-duplex speech-vision (AV2A) MiniCPM-o 4.53.033.543.633.4073% / 720 ms MiniOmni20.651.371.541.1964% / 3080 ms VITA-1.51.191.572.531.7658% / 400 ms

Audio-only baselines (A2A; same agents, video withheld) Gemini 2.5 Flash Native3.352.983.173.1773% / 2760 ms Gemini 3.1 Flash Live3.402.643.033.0369% / 1240 ms OpenAI gpt-realtime-mini3.052.483.122.8869% / 5000 ms OpenAI gpt-realtime2.932.373.592.9767% / 4440 ms MiniCPM-o 4.53.453.763.103.4472% / 920 ms MiniOmni21.481.702.151.7269% / 2760 ms VITA-1.51.621.373.022.0061% / 800 ms

Table 1. Performance breakdown across Perception rubrics. AV2A and A2A runs are paired on the same clips to isolate the visual contribution.

Model Fluency ↑ Dyadic Affect ↑ NV Cue Approp. ↑ Overall ↑ Timing ↑

Human ground truth4.424.143.183.9278% / 900 ms Cascaded speech-to-avatar (A2AV) Gemini 2.5 + Anam3.483.211.712.8044% / 2840 ms Gemini 2.5 + Keyframe3.432.601.132.3931% / 3520 ms

Table 2. Performance breakdown across Generation rubrics. Cascade architecture caps nonverbal-cue production well below human ground truth.

Submit your model's results on VideoFDB.

Contact us at amritam@nvidia.com with your model's per-sample outputs and we'll score them and produce a leaderboard row. We'll soon release an automated evaluation pipeline to make submission easier and more accessible.

Get the evaluation dataset

Contributions

A benchmark grounded in human communication science. 237 expert-annotated dyadic clips from natural video calls, spanning 11 conversational dynamics drawn from established interpersonal-communication research — turn-taking, backchannels, gaze aversion, adaptors, affect displays, and more.

A perception/generation taxonomy with interpretable rubrics. Perception axes score whether the agent reads the situation; generation axes score whether the agent's own audio-visual output coheres. Three rubric axes per category, each scored 0–5 by an LM judge with stable cross-judge agreement (77–89% within 1 point).

Next steps for research: We identify captioning collapse, visual-stream ignorance, and the lack of full-duplex avatar pipelines as key opportunities for future work in Interactive Agentic AI.

Motivation

Consider a brief pause in the middle of a sentence (Figure 1). An audio-only agent may treat it as a turn handoff and start speaking. But with both audio and video together, the same moment has more context: a shifted gaze and raised head can signal the user is still thinking, so the right response is to wait.

Figure 1. VideoFDB curates evaluation samples from natural two-person video calls and evaluates perception and generation across 11 dynamic categories.

What an agent does while the user is still speaking matters as much as what it says next and when. Most evaluations split dialogue into turns and focus on the response latency or...

VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Agents

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy