Finding a Feedback Loop: shipping my first prod agentic feature at Pair Team

Finding a Feedback Loop - by Pair Team and Rob Durst

Building the Technology Behind Whole-Person Care

SubscribeSign in

Finding a Feedback Loop Notes from shipping my first production agentic feature at Pair Team.

Pair Team and Rob Durst Jun 22, 2026

Sitting down at my desk, hands on keyboard, an old engineering mantra from my days deep in the SF Bitcoin community reverberated through my mind: “Don’t trust. Verify” On the surface this felt like the antithesis of today’s vibe-based engineering culture. Yet, beyond the boisterous rhetoric is a set of practices quickly becoming industry standard for operating non-deterministic systems reliably at scale. Tasked with shipping my first agentic feature, I had a chance to learn what one of those practices actually looks like. This is my experience. A Simple Enrollment Call

As an entryway into various programs we offer here at Pair Team, our AI healthcare advocate Flora will conduct an outbound enrollment call. This call is simple. Within the funnel of our service, it sits near the top, operating as a mildly stateful conduit responsible for introducing a patient into our system. Unsurprisingly, there are roughly four ways this call can go: No answer

Voicemail

Patient answers and requests a callback

Patient answers and engages

When patients request a callback, Flora’s job is to verbally confirm we’d reach back out to them at an agreed-upon time.. So, in my second week here at Pair Team, I took on the task of helping Flora do this well. Prompts All the Way Down?

Given some callback time as an output from the call, it was pretty straightforward to then kick off another scheduled call at some future time via our background job processor, Resque. This bit was fairly trivial.

The more interesting part was how do we know when to schedule the callback for? With our partner ElevenLabs, there is a call analysis data collection feature: Data collection automatically extracts structured information from conversation transcripts using LLM-powered analysis. This enables you to capture valuable data points without manual processing, improving operational efficiency and data accuracy.

And so we can expand upon our design above to extract the callback time on the ElevenLabs side.

How this actually works is you navigate to the Analysis view in ElevenLabs and create a new Data Collection data point, which comprises: an identifier, a data type, and a prompt. Now on that last point, yes, to extract the requested callback time we do introduce another prompt! Coming from a traditional software background, this made me a bit uneasy. The conversation upstream was already non-deterministic but until now nothing downstream depended on a structured reading of it. The extraction prompt here is the bridge: it takes a non-deterministic conversation and hands a deterministic scheduler something to act upon. That bridge is now wholly dependent on a prompt’s interpretation of the transcript.

Many years ago I made the mistake choice of taking a course in college called Real Analysis. This infamously challenging theoretical math course more or less defeated me, yet the core ideas, at least in principle, stay with me to this day. On the first day of class the professor made a bold claim: Here are the foundations on which this domain of math is based. If you disagree with its foundations, nothing I will teach you here in the next four months holds.

In any chain of reasoning, if the first link is wrong, what follows can’t be trusted either. I believe we can leverage that core principle here: If the extraction isn’t correct, our processing and subsequent rescheduling may also be incorrect.

So, how do we keep this part of the system honest? While this may seem like a low blast radius failure, it is not. For Pair Team, a missed callback is potentially somebody who needed care and didn’t get it. Developing a Feedback Signal

At first I considered a regex check on the transcript. It seemed that it might be possible to just deterministically search for some datetime and “call back”. However, the core question “did the patient actually request a callback?” is semantic, not syntactic. The set of all possible ways to make this request is likely too large to codify. And so falling back, what else could we do? It turns out we can just introduce another prompt to keep our load-bearing prompt in check. While a bit tongue in cheek, I am serious. For this we will be employing an LLM-as-Judge eval (see a nice post by Anthropic on evals). As defined by Langfuse’s docs: LLM-as-a-Judge is an evaluation methodology where an LLM is used to assess the quality of outputs produced by another LLM application. Instead of relying solely on human reviewers or simple heuristic metrics, you prompt a capable model (the “judge”) to score and reason about application outputs against defined criteria.

Given what context we have, our evaluation process will then look a bit like this with an implicit model...

Finding a Feedback Loop: shipping my first prod agentic feature at Pair Team

Related Articles

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI