Finding a Feedback Loop - by Pair Team and Rob Durst
Building the Technology Behind Whole-Person Care
SubscribeSign in
Finding a Feedback Loop<br>Notes from shipping my first production agentic feature at Pair Team.
Pair Team and Rob Durst<br>Jun 22, 2026
Share
Sitting down at my desk, hands on keyboard, an old engineering mantra from my days deep in the SF Bitcoin community reverberated through my mind:<br>“Don’t trust. Verify”<br>On the surface this felt like the antithesis of today’s vibe-based engineering culture. Yet, beyond the boisterous rhetoric is a set of practices quickly becoming industry standard for operating non-deterministic systems reliably at scale. Tasked with shipping my first agentic feature, I had a chance to learn what one of those practices actually looks like.<br>This is my experience.<br>A Simple Enrollment Call
As an entryway into various programs we offer here at Pair Team, our AI healthcare advocate Flora will conduct an outbound enrollment call. This call is simple. Within the funnel of our service, it sits near the top, operating as a mildly stateful conduit responsible for introducing a patient into our system.<br>Unsurprisingly, there are roughly four ways this call can go:<br>No answer
Voicemail
Patient answers and requests a callback
Patient answers and engages
When patients request a callback, Flora’s job is to verbally confirm we’d reach back out to them at an agreed-upon time.. So, in my second week here at Pair Team, I took on the task of helping Flora do this well.<br>Prompts All the Way Down?
Given some callback time as an output from the call, it was pretty straightforward to then kick off another scheduled call at some future time via our background job processor, Resque. This bit was fairly trivial.
The more interesting part was how do we know when to schedule the callback for? With our partner ElevenLabs, there is a call analysis data collection feature:<br>Data collection automatically extracts structured information from conversation transcripts using LLM-powered analysis. This enables you to capture valuable data points without manual processing, improving operational efficiency and data accuracy.
And so we can expand upon our design above to extract the callback time on the ElevenLabs side.
How this actually works is you navigate to the Analysis view in ElevenLabs and create a new Data Collection data point, which comprises: an identifier, a data type, and a prompt.<br>Now on that last point, yes, to extract the requested callback time we do introduce another prompt! Coming from a traditional software background, this made me a bit uneasy. The conversation upstream was already non-deterministic but until now nothing downstream depended on a structured reading of it. The extraction prompt here is the bridge: it takes a non-deterministic conversation and hands a deterministic scheduler something to act upon. That bridge is now wholly dependent on a prompt’s interpretation of the transcript.
Many years ago I made the mistake choice of taking a course in college called Real Analysis. This infamously challenging theoretical math course more or less defeated me, yet the core ideas, at least in principle, stay with me to this day. On the first day of class the professor made a bold claim:<br>Here are the foundations on which this domain of math is based. If you disagree with its foundations, nothing I will teach you here in the next four months holds.
In any chain of reasoning, if the first link is wrong, what follows can’t be trusted either.<br>I believe we can leverage that core principle here:<br>If the extraction isn’t correct, our processing and subsequent rescheduling may also be incorrect.
So, how do we keep this part of the system honest? While this may seem like a low blast radius failure, it is not. For Pair Team, a missed callback is potentially somebody who needed care and didn’t get it.<br>Developing a Feedback Signal
At first I considered a regex check on the transcript. It seemed that it might be possible to just deterministically search for some datetime and “call back”. However, the core question “did the patient actually request a callback?” is semantic, not syntactic. The set of all possible ways to make this request is likely too large to codify.<br>And so falling back, what else could we do? It turns out we can just introduce another prompt to keep our load-bearing prompt in check. While a bit tongue in cheek, I am serious. For this we will be employing an LLM-as-Judge eval (see a nice post by Anthropic on evals).<br>As defined by Langfuse’s docs:<br>LLM-as-a-Judge is an evaluation methodology where an LLM is used to assess the quality of outputs produced by another LLM application. Instead of relying solely on human reviewers or simple heuristic metrics, you prompt a capable model (the “judge”) to score and reason about application outputs against defined criteria.
Given what context we have, our evaluation process will then look a bit like this with an implicit model...