The Four Signals of AI Observability
We're live
Live on Twitch!
thoughtbot is livestreaming
Work alongside the thoughtbot team as we collaborate with each other and our clients, live. Ask us anything, we're live right now!
Menu
Let’s get started!
Back
View all Services
Development
Ruby on Rails
Hotwire
AI and Machine Learning
Maintenance
Mobile
Design
Shaping Sprints
UX, UI, and Product Design
Design Systems
Product
MVP
Product Management
Accessibility
Team and Processes
Team Augmentation
Fractional Leadership
View all Services
Back
Healthcare
Finance
Insurance
Back
View all Resources
Development
Tech Leadership Resources
Open Source
Books
The Bike Shed Podcast
Live Streaming on YouTube
Conference Talks
The business of great software
Playbook
Giant Robots Smashing Into Other Giant Robots Podcast
Design Sprint Guide
Live Streaming on LinkedIn
View all Resources
A few months ago we shipped a chat experience to production. Users ask a<br>question, our app routes it through an LLM model, the model calls a few internal<br>tools, and an answer comes back from it.
It worked. Sort of.
When the model answered well, we had no idea why. When it answered badly, we had<br>no idea either. The model was a black box attached to our app, and our best<br>debugging tool was reading logs and guessing.
We realized our app could not answer a very normal operational question:
Show us every chat where the user said the answer was bad, group them by which<br>version of the system prompt was loaded, and let us read the whole<br>conversation, including which tools the model called.
It’s the AI equivalent of “show me every 500 errors on this endpoint after deploy X.”<br>But our app couldn’t answer it.
That was the trigger to stop looking for a smarter model and start looking to<br>add an observability layer. We ended up using Langfuse, but the specific vendor<br>matters less than the capabilities. Helicone, Arize Phoenix, LangSmith, and<br>Braintrust all solve versions of the same problem.
After a couple of months of iteration, we noticed that the things we need came<br>in four flavors. I call them the four signals that every AI feature needs to<br>emit about itself.
A version on every prompt. Which exact words did the model see today?
A trace shaped like the actual work. What did it call, in what order,<br>with what arguments?
A score from the user. Did the human like the result?
A score from another model. When the human is quiet, who is grading?
Of course we can build an AI feature without all four. We just can’t improve it on purpose.
A version on every prompt
The first thing we did was move every prompt out of the code and into a<br>versioned store the app fetches at runtime.
# The code never references a version. It asks for a label.<br>template = PromptRepo.compile(name: "classify_question", label: "production")
# A human moves "production" between versions in the Langfuse UI.<br># Promotion is a click. Rollback is a click. No deploy.
The first time we rolled back a bad prompt by clicking a button instead of reverting a PR and waiting for CI, we knew this was the right shape.
Once prompts became content, the people closest to the problem became the people writing the prompts.<br>The feedback loop got much shorter, and the quality went up.
A trace shaped like the actual work
A chat is not a single call. It is a small program. Classify the question, load the<br>right prompt, call a tool or two, then compose an answer.
If your trace is one row, you only know that something happened. A trace tree tells<br>you what actually happened. If your trace is a tree of calls, you have a database of decisions<br>the model made.
# Before: one log line, no shape<br>[INFO] chat_completed user_id=123 duration_ms=4200 tokens=1840
# After: a tree of decisions<br>trace: "chat"<br>span: load-prompt (version=production:v12)<br>generation: classify-question (model=haiku, category="billing")<br>generation: compose-answer<br>span: tool-call.lookup_invoice (200ms)<br>span: tool-call.lookup_customer (180ms)<br>generation: final-response (model=sonnet, 1.2k tokens)
Each node carries the prompt name and version, the model id, token usage, and a<br>set of metadata fields we control. The customer it ran for, the category the<br>question was classified as, which tools ran, whether the conversation was new.
That metadata is the part that turned out to matter most.
The first time we filtered traces to “every chat in scope X where a<br>particular tool ran and the user said the answer was bad”, we had a small<br>realization. The trace list was not a log anymore. It was a queryable database<br>of decisions the model made.
The rule we would write on a sticky note: tag your traces with the dimensions<br>you will want to filter on later . It is cheap up front and impossible to add<br>later, once you wish you had it.
A score from the user
Every assistant message in the UI has a thumbs up and a thumbs down. When a user<br>clicks one, we save a row and post it...