Stop Monitoring AI Systems Like Web Services

SwirlAI Newsletter

SubscribeSign in

Stop Monitoring AI Systems Like Web Services Five questions every AI system has to answer, and the metrics that answer them.

Aurimas Griciūnas Jun 14, 2026

👋 I am Aurimas. I write the SwirlAI Newsletter with the goal of presenting complicated Data related concepts in a simple and easy-to-digest way. My mission is to help You UpSkill and keep You updated on the latest news in AI Engineering, Data Engineering, Machine Learning and overall Data space. SwirlAI Newsletter is a reader-supported publication. To receive new posts and support my work, consider becoming a subscriber.

Many AI systems are still monitored like the web services they sit next to. The API gateway emits uptime, error rates, and latency percentiles, the dashboards come free with the infrastructure. Unfortunately, none of those numbers can tell you that users stare at a blank screen for four seconds before the first token is displayed, or that token spend per task has doubled since the last prompt update, or that the model has started inventing answers around the retrieved context instead of from it. The gap exists because an LLM system breaks the assumptions web monitoring was built on. Responses are generated token by token, so “latency” is at least three different numbers depending on where on the timeline you stand. Cost scales with tokens rather than requests. Also, the most damaging failures are silent: a quality regression does not throw a 500, it returns confident text with a 200 status code. For me personally it helps to group metrics by the question they answer. Five questions cover most of what goes wrong in production: is it fast, can it scale, is it correct, does it hold up, and when there is an agent in the loop, how does it behave. This article walks through each group, what the metrics mean mechanically, and which ones you have to build yourself because nothing emits them by default.

The metrics map

Before moving to the questions, next week on Thursday I will be running a workshop - From AI Demo to Deployed App. AI engineers can build the backend, but most stop when it’s time to put a usable frontend in front of real users. Join me and learn how to: Port a Streamlit prototype or a vibecoded frontend app to v0

Ship the app to production on Vercel

Ship a new UI feature on top of an existing backend

Is It Fast? (Latency)

An LLM request has two phases, and they produce different latency numbers. During prefill the model ingests the entire prompt and builds its internal state, with nothing visible to the user while it happens. During decode the model generates output one token at a time. Every latency metric worth tracking is a position on that timeline. Time to first token (TTFT) is the amount of time from sending the request to the first token arriving, which is queueing time plus prefill. In a streaming UI this is the number users feel (perceived latency), because it is exactly how long they look at a blank screen. TTFT grows with prompt length, which is why RAG systems that pack large contexts into the prompt pay for it in perceived speed. Inter-token latency (ITL, also reported as time per output token, TPOT) is the gap between consecutive tokens once streaming starts, and it determines whether output reads as flowing text properly. Users tolerate a slow but steady stream far better than a fast one that freezes. End-to-end latency at p50, p95, and p99 is the full span, and output length dominates it. That makes a single global percentile close to meaningless: it averages 50-token classification calls with 2,000-token report generations. Track end-to-end latency per use case, so each number has one workload behind it. Agents add a compounding effect. A task that chains several sequential LLM calls multiplies the per-call numbers, and a tolerable per-call p95 can become an intolerable task-level latency. For agentic workloads, set the latency budget at the task level and let it constrain the steps.

Inference latency metrics

Can It Scale? (Throughput and Cost)

Tokens per second per user versus total system throughput. Serving systems batch concurrent requests on the same hardware, and the batch size is what you can adjust: larger batches improve total system tokens per second while each individual stream slows down. The two metrics trade off against each other on the same GPUs. So decide per workload which one you are protecting. Interactive chat should favour per-user speed, offline batch processing should favour total throughput. Input and output tokens per request. There is not much to add here, it is the core of unit economics, with output tokens typically priced at a multiple of input tokens. Log both per request, break down by use case, monitor carefully. Cache hit rate. Prompt caching makes repeated prompt prefixes dramatically cheaper to process, it is one of the largest...

Stop Monitoring AI Systems Like Web Services

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

It's Not Just X. It's Y