Teaching Codex to Test a Voice-First Calendar App

Teaching Codex to Test a Voice-First Calendar | Elicited

Go back Teaching Codex to Test a Voice-First Calendar 24 May, 2026 | Suggest Changes Teaching Codex to test a voice-first calendar

AI-generated entry. See What & Why for context.

I have been working on version 2.0 of KIN, a voice-first shared family calendar.

The core interaction is deliberately simple: hold to talk, say something like “add soccer practice tomorrow at 5”, and KIN turns that into a calendar event for the family.

That is a nice interaction for humans.

It is an annoying interaction for tests.

Normal UI automation is good at tapping buttons and typing text. It is much worse at pretending to be a person who presses and holds a microphone button, speaks into the iOS Simulator, waits for transcription, waits for an AI-backed calendar operation, and then verifies that the right event was actually created.

The version that finally worked combined a few pieces:

Codex writes and operates the harness

FlowDeck builds, runs, tests, and reads simulator logs

Loopback turns generated audio into simulator microphone input

XCUITest controls the exact hold-to-talk timing

the local backend handles the real transcription and calendar inference path

Once those pieces were in place, I could run a local end-to-end test that creates a real event in KIN by injecting audio into the iOS Simulator.

The problem

KIN is not a form with a microphone icon attached to it.

The voice path is the product path:

The user long-presses the voice bar.

The app starts recording.

The audio goes through transcription.

The transcript goes to the calendar AI backend.

The backend mutates calendar state.

The app shows the result.

If I mock the transcript, I am not testing the microphone path.

If I mock the backend, I am not testing the calendar agent.

If I only test the backend, I am not testing whether the iOS app actually records and sends audio correctly.

What I wanted was a local test that exercised the same path a user does, without requiring me to sit there and repeat the same sentence into my laptop ten times.

Why this moved out of Maestro

I still use Maestro for ordinary UI smoke flows. It is good for that.

Voice input has one awkward constraint, though: the press duration matters.

KIN starts recording while the user holds the voice bar and stops when the press ends. If the utterance is three seconds long, the test needs to hold for a little more than three seconds. If the utterance is six seconds long, the test needs a different hold duration.

For this particular job, Maestro’s long press was too fixed.

XCUITest gives me the one API I needed:

press(forDuration:) That one call is why the voice tests moved into XCUITest. The test can hold the real voice_assistant_bar for exactly as long as the audio fixture needs.

The audio trick

The key was to stop treating the simulator as a magical testing object and start treating it as another Mac app with an audio input menu.

Loopback creates a virtual audio device on macOS. I created a device named Loopback Audio with a pass-through source. Then the runner does two things:

sets macOS output to Loopback Audio

sets macOS input to Loopback Audio

Now anything I play from the Mac can also appear as microphone input.

The simulator still has to use that microphone. FlowDeck opens Simulator, and an AppleScript helper selects the same device from:

Simulator -> I/O -> Audio Input -> Loopback Audio At that point, the path is:

/usr/bin/afplay -> macOS output -> Loopback Audio -> Simulator mic -> KIN For generated tests, the audio file starts as text:

/usr/bin/say -o utterance.aiff -- "Add loopback single amber river tomorrow at 3 PM." That part is surprisingly easy. macOS ships with the built-in say utility, and it can write spoken audio directly from a string at test time. The runner does not need a library of pre-recorded fixtures for every happy-path command. It can generate the phrase, measure the resulting AIFF file, route it through Loopback, and use a unique marker phrase for the database assertion.

For more realistic tests later, the same harness can use a recorded fixture:

KIN_VOICE_AUDIO_FIXTURE=/absolute/path/to/sample.wav The app does not know the difference. From KIN’s point of view, someone spoke into the microphone.

The local path

Timing the long press

This is the part that made the test feel real instead of lucky.

The runner measures the audio duration:

/usr/bin/afinfo utterance.aiff Then it computes:

holdDuration = audioDuration + 1.25 seconds The extra time gives the app room to start recording and finish ingesting the last bit of audio.

The XCUITest does not play audio itself. Instead, the runner starts a tiny local helper server with three endpoints:

/health

/config

/play

XCUITest loads /config, presses the voice bar, calls /play, and keeps holding until holdDuration has elapsed.

The helper waits briefly before playing audio:

playbackDelay = 0.45 seconds That delay...

Teaching Codex to Test a Voice-First Calendar App

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

SpaceX not the behemoth everyone thought

The Mirror Is Part of the Machine

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits