The LLM Narrates. The Code Decides. | Irin Observability
The LLM narrates. The code decides.
June 25, 2026
Most of the “AI for observability” work I see right now hands the language model the judgment. I think that’s backwards. Feed it the alert, feed it some metrics, ask it what’s wrong, what should be done, and let it make the judgement call. Based on my experience working with language models, I decided that inverting the process provides better results.
The short version: in my alerting pipeline, the set of allowable classifications is fixed in deterministic Python, and the model has to pick from it. The LLM’s only job is to turn a structured verdict into an easily digestible sentence. It never decides whether something is bad, how bad it is, or what category of problem it is. It narrates within a decision space the code has already locked down.
// THE PROBLEM
I run a small managed monitoring service. Alertmanager fires, a webhook lands, and historically that webhook produced a line like HighMemoryUsage on host web-vm, severity warning, which is accurate, but not terribly helpful. The person reading it still has to know what HighMemoryUsage implies, whether this host always runs hot, and whether to care. I wanted plain-English context attached to the alerts without altering the alert delivery process.
The obvious move was to throw the whole alert at an LLM and ask it to explain. I tried that in the first iteration of this experiment, expecting it to be somewhat accurate, but not entirely reliable, and it did not disappoint, the model was confidently inconsistent. The same alert, fired three times, produced three different “root cause” categories. One run called a test alert a “Configuration or setup issue,” the next called it “Configuration/Testing,” the next something else again. If you’re storing that output to do any kind of aggregation later (I am, I want to know when three different clients hit the same class of problem in the same week), free-form model output fragments into noise. Grouping on a field the model changes at random won’t work.
I kind of knew from the onset that I wouldn’t get amazing results, and that it would be harder than it looked. So I started doing some research, and decided to flip the design. The little voice in the back of my head was right all along, don’t let the model make the decisions.
// THE SPLIT
I created a pipeline that has a hard wall down the middle.
On the deterministic side, Python does the classifying. The output is constrained to an eight-value enum: memory_pressure, cpu_saturation, disk_pressure, service_unavailability, network_issue, configuration_error, external_dependency, unknown. I aggregate on that field because it can only ever be one of eight strings. If nothing fits, the answer is unknown, which is itself a useful signal rather than a hallucinated/variant guess.
On the narrative side, the LLM (llama3:8b, running locally on a box on my own LAN, data/network secure) must choose its classification from that fixed eight-value set, and it writes two short fields alongside it: what the alert is, in plain English, and what it means operationally. The code defines the shape of the answer; the model only fills in a slot that already exists. It is explicitly instructed not to suggest fixes and not to invent a cause, so it performs translation instead of analysis.
The prompt returns strict JSON, grammar-constrained, so I get {what, means, likely_cause_class} every time and the enum value is validated against the allowed set on the way out. If the model returns something off-list, I capture the bug instead of storing a row.
// CONTEXT HYDRATION
A naive version of even the narration step gets you alarmist prose. When tuning the system I received a DiskFillPredicted alert, which on its face looked worth investigating. Then I looked at the host in question, which has had a flat disk-utilization baseline for months. The prediction was a rounding artifact, “your disk is about to fill” is actively misleading. The model had no way to know that from the alert alone, so it just wrote something.
I fixed it by giving the model the same context a human would look at before reacting. Prior to the LLM call, Python does a fast lookup against the metrics backend for that host’s recent baseline, and the prompt carries an explicit rule: if historical context is provided, weigh it over the alert’s literal text. A predicted-disk-fill on a host with a stable months-long baseline is informational, not urgent.
The latency budget for that lookup is five seconds. The LLM call itself takes about eighty seconds, because this runs deliberately on modest CPU-only hardware, a 4th generation Intel i7 with 16GB RAM and no GPU. That is a choice, not a constraint I am apologizing for: the whole posture of the service is that nothing leaves my LAN, so a slow local...