Building for Voice In, Visuals Out

surprisetalk1 pts0 comments

Building for Voice In, Visuals Out - Allen Pike

Allen Pike

Articles

About

Follow

Building for Voice In, Visuals Out

Flashes of brilliance, and the tyranny of latency.

May 31, 2026 •<br>7 min read

Recently, Andrej Karpathy argued that the ideal interaction pattern for AI models is voice in, visuals out :

Audio is the human-preferred input to AIs, but vision is the preferred output from them. Around a ~third of our brains are a massively parallel processor dedicated to vision; it is the 10-lane superhighway of information into brain.

The claim is that while “text in, markdown out” is the mode most people use LLMs today, what we should be building toward is a Jarvis-like mode where we primarily speak to AI – and it primarily responds with UI, video, or other visuals.

Let’s check in on where we’re at for both halves of this claim: visuals as output, and voice as input.

Visuals Out

Humans love looking at things!

While it can be convenient to be able to listen to our computers speak, waiting through a voice response feels kinda… ugh. You can increase the speaking rate, but fundamentally, the fastest way for a computer to give humans information is to display it.

We’re faster at reading text than we are at listening, but that’s just the start. There’s a good reason computers long ago evolved past text-only terminals: richer interfaces are often faster, clearer, nicer, and more useful. The power of human vision has facilitated a rich history of computers showing people stuff.

At first, LLMs weren’t great at producing visuals, often spending many tokens to produce half-baked results. However, Anthropic’s Thariq Shihipar recently wrote how HTML is increasingly a viable output format to supplant Markdown, for certain model responses. This is great, since HTML is a powerful way to show visuals.

Going beyond text can give us dynamic:

Hierarchy (sidebars, columns, navigation)

Exploration (drill ins, filters, expansion)

Direct manipulation (scrolling, dragging)

Data visualizations (graphs, charts, dashboards)

Mockups and prototypes (show, not tell)

Illustrative images and video (pelicans, bicycles)

Thus the DOS era of AI begins to end.

While it will be a while before general-purpose agents consistently return compelling HTML in response to arbitrary requests, visual responses are already practical for vertical agents – it helps to do one thing well. Recent months have seen a noticeable uptick in AI features producing useful diagrams, charts, sliders, and so on.

So, yep. Visual output is a natural fit for AI, and we’re already going beyond plain text.

Voice in

On the other hand, most people are ambivalent about the idea of talking to AI. We were promised the Star Trek computer, or Jarvis, but so far we’ve gotten Siri and automated spam calls.

There’s merit to the skepticism. Fundamentally, voice is never going to be the only input mode for computers. Just as we sometimes need voice because our hands are occupied, other times it’s impractical to speak aloud for social or confidentiality reasons. And even when we can speak, voice alone isn’t enough – effective computer use will always require more precise inputs, such as mouse clicks and drags.

However, voice is a deeply human and useful input mode. For example, it’s excellent for getting out our not-yet-organized thoughts and observations. While ChatGPT voice mode is substantially dumber than its text mode, it can still be useful for organizing your thoughts – advanced rubber-ducking.

Compared to text, speech also contains additional nuance and detail.

Voice is not just words – it’s intonation, timing, tone, pitch, energy, and emphasis. Where a transcript would only see okay, how you voice the “okay” might convey “Sounds good!”, “Tell me more”, “I kind of doubt that.” or “Get the hell out of my office.” This is why we call somebody if we need to have an emotional conversation, rather than sending misinterpretable text messages.

We speak faster than we type in terms of WPM, so together with the additional details in our voice, we simply put out more information per second via voice than from a keyboard.

The Tyranny of Latency

So, great. Talking to AI and having it respond with visuals are both natural and highly useful. Why aren’t we doing this all the time?

If you’ve actually used AI voice systems, you’ve probably noticed that they’re usually slow, dumb, or both.

In order to feel fast, we’ve known since the 60s that computers should respond within about 100ms, and that in order to keep users’ sense of flow, they need to respond within about 1000ms (1 second). Even before networks and giant neural nets, it could be a challenge to hit these bars.

But voice AI adds a substantial new hurdle. Humans are more sensitive to lagged voice than we are to lagged visuals. For a fully fluid voice conversation with interruptions going both ways, the latency bar is about 200ms. More than that, and interruptions feel janky and annoying. You’ve...

voice visuals text mode while speak

Related Articles