Turning spoken commands into JSON tool calls on iPhones

Speech-to-tool pipeline performance measured | WildEdge Blog

Appearance

Return to top

Speech-to-tool pipeline performance measured 22 June 2026·Piotr Duda, Wojciech Kedzierski, Damian Kolakowski Voice interfaces feel good only when the action happens quickly. For dictation, users tolerate some delay because the output is long-form text. For tool calls, the expected output is small: start a timer, create a reminder, change a setting, trigger a workflow. A few seconds of latency can make the interaction feel heavy and artificial. We ran a benchmark inside an iOS app to compare two ways of turning spoken intent into a structured tool call: Direct speech-to-tool: pass audio to one model and ask it to produce the tool call. Two-step speech-to-text, then text-to-tool: transcribe the audio first, then pass the transcript to a small text model that returns the tool call. The practical question was simple: if the feature has to run on device, which path gets to a valid JSON tool call faster? The benchmark The benchmark used the WildEdge Swift SDK to report processing-time telemetry from the app. WildEdge remote configuration handled prompt and model selection between benchmark runs, which let us compare paths without rebuilding the TestFlight app.

The test set was intentionally narrow: 18 .m4a recordings across two voices 9 short command cases 3 voice-to-action use cases English input from non-native English speakers Expected JSON tool-call output for each case Download the benchmark audio dataset. The dataset contains 18 short .m4a command recordings. In total, it covers 60.203 seconds of audio and 981.3 kB of files. The average clip length is 3.345 seconds , and the average clip size is 54.5 kB . This was primarily a latency benchmark, not a full model accuracy evaluation. Proper accuracy evaluation would require a much larger and more varied dataset. In this narrow test set, output accuracy was similar across models and close to 100% for most cases because the commands were intentionally simple. Some other constraints matter: No streaming. Each run starts after the full speech file is available. Audio conversion happens outside the measured interval. The accepted input format was a WAV container with Linear PCM audio, 16,000 Hz sample rate, one mono channel, and 16-bit integer samples. No model fine-tuning was involved; task-specific fine-tuning may change final latency results by reducing prompt and schema-handling overhead. For more context, see Let’s build an on-device voice agent. Text-to-tool prompts are lightly adapted per model while staying similar in length. The combined two-step results below are summed stage medians, not full paired end-to-end runs. Several techniques can reduce perceived latency or time to first token: streaming input, partial decoding, voice activity detection, speculative execution, and overlapping pipeline stages. We did not evaluate those here. For this benchmark, we intentionally provided complete recordings first, then measured raw processing time across different models and devices. Approaches compared The charts below show the shape of each pipeline. They are normalized stage diagrams, not the measured benchmark medians; the measured results are reported in the sections below. Direct speech-to-tool The model receives speech input and directly produces the tool call.

This avoids an explicit intermediate transcript step, which may reduce latency and simplify the pipeline. The tradeoff is that the app needs multimodal speech-to-tool capability, which is more complex to package than a simple text-only llama.cpp setup. Speech-to-text, then text-to-tool The speech input is first transcribed into text. That text is then passed to a second step that generates the tool call.

This approach may be easier to debug and inspect because the app does not need to run a multimodal speech-to-tool model. The first stage produces plain text, and the second stage can use a smaller text-to-tool model. The tradeoff is that the app now has two stages to orchestrate. Hardware matters We used iPhone 16 Pro as the primary benchmark device, then ran a smaller hardware sweep to understand how the direct speech-to-tool path changes across older Apple hardware. For this baseline, LeapSDK loaded Liquid's LFM2 Audio 1.5B model. The device order below is oldest to newest: iPhone 11, iPhone 12, iPhone 13 mini, iPhone 13 Pro Max, and iPhone 16 Pro.

The largest jump was from iPhone 11 to iPhone 12. Median direct speech-to-tool latency dropped from 17.39 seconds on iPhone 11 to 2.61 seconds on iPhone 12, a roughly 6.7x improvement. On iPhone 16 Pro, the same LeapSDK direct LFM speech-to-tool path produced a valid tool-call JSON in 1.36 seconds median . That is inside our rough 1.5-second practical line for a voice-to-action feature, though the preferred target for this interaction is still sub-second. LeapSDK-reported token throughput showed the same hardware story:

Throughput...

Turning spoken commands into JSON tool calls on iPhones

Related Articles

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org