The Four Signals of AI Observability

iroha12031 pts0 comments

The Four Signals of AI Observability

We're live

Live on Twitch!

thoughtbot is livestreaming

Work alongside the thoughtbot team as we collaborate with each other and our clients, live. Ask us anything, we're live right now!

Menu

Let’s get started!

Back

View all Services

Development

Ruby on Rails

Hotwire

AI and Machine Learning

Maintenance

Mobile

Design

Shaping Sprints

UX, UI, and Product Design

Design Systems

Product

MVP

Product Management

Accessibility

Team and Processes

Team Augmentation

Fractional Leadership

View all Services

Back

Healthcare

Finance

Insurance

Back

View all Resources

Development

Tech Leadership Resources

Open Source

Books

The Bike Shed Podcast

Live Streaming on YouTube

Conference Talks

The business of great software

Playbook

Giant Robots Smashing Into Other Giant Robots Podcast

Design Sprint Guide

Live Streaming on LinkedIn

View all Resources

A few months ago we shipped a chat experience to production. Users ask a<br>question, our app routes it through an LLM model, the model calls a few internal<br>tools, and an answer comes back from it.

It worked. Sort of.

When the model answered well, we had no idea why. When it answered badly, we had<br>no idea either. The model was a black box attached to our app, and our best<br>debugging tool was reading logs and guessing.

We realized our app could not answer a very normal operational question:

Show us every chat where the user said the answer was bad, group them by which<br>version of the system prompt was loaded, and let us read the whole<br>conversation, including which tools the model called.

It&rsquo;s the AI equivalent of &ldquo;show me every 500 errors on this endpoint after deploy X.&rdquo;<br>But our app couldn&rsquo;t answer it.

That was the trigger to stop looking for a smarter model and start looking to<br>add an observability layer. We ended up using Langfuse, but the specific vendor<br>matters less than the capabilities. Helicone, Arize Phoenix, LangSmith, and<br>Braintrust all solve versions of the same problem.

After a couple of months of iteration, we noticed that the things we need came<br>in four flavors. I call them the four signals that every AI feature needs to<br>emit about itself.

A version on every prompt. Which exact words did the model see today?

A trace shaped like the actual work. What did it call, in what order,<br>with what arguments?

A score from the user. Did the human like the result?

A score from another model. When the human is quiet, who is grading?

Of course we can build an AI feature without all four. We just can&rsquo;t improve it on purpose.

A version on every prompt

The first thing we did was move every prompt out of the code and into a<br>versioned store the app fetches at runtime.

# The code never references a version. It asks for a label.<br>template = PromptRepo.compile(name: "classify_question", label: "production")

# A human moves "production" between versions in the Langfuse UI.<br># Promotion is a click. Rollback is a click. No deploy.

The first time we rolled back a bad prompt by clicking a button instead of reverting a PR and waiting for CI, we knew this was the right shape.

Once prompts became content, the people closest to the problem became the people writing the prompts.<br>The feedback loop got much shorter, and the quality went up.

A trace shaped like the actual work

A chat is not a single call. It is a small program. Classify the question, load the<br>right prompt, call a tool or two, then compose an answer.

If your trace is one row, you only know that something happened. A trace tree tells<br>you what actually happened. If your trace is a tree of calls, you have a database of decisions<br>the model made.

# Before: one log line, no shape<br>[INFO] chat_completed user_id=123 duration_ms=4200 tokens=1840

# After: a tree of decisions<br>trace: "chat"<br>span: load-prompt (version=production:v12)<br>generation: classify-question (model=haiku, category="billing")<br>generation: compose-answer<br>span: tool-call.lookup_invoice (200ms)<br>span: tool-call.lookup_customer (180ms)<br>generation: final-response (model=sonnet, 1.2k tokens)

Each node carries the prompt name and version, the model id, token usage, and a<br>set of metadata fields we control. The customer it ran for, the category the<br>question was classified as, which tools ran, whether the conversation was new.

That metadata is the part that turned out to matter most.

The first time we filtered traces to &ldquo;every chat in scope X where a<br>particular tool ran and the user said the answer was bad&rdquo;, we had a small<br>realization. The trace list was not a log anymore. It was a queryable database<br>of decisions the model made.

The rule we would write on a sticky note: tag your traces with the dimensions<br>you will want to filter on later . It is cheap up front and impossible to add<br>later, once you wish you had it.

A score from the user

Every assistant message in the UI has a thumbs up and a thumbs down. When a user<br>clicks one, we save a row and post it...

model prompt answer trace live version

Related Articles