Monitor LLM Inference in Production (2026): Prometheus & Grafana for vLLM, TGI, llama.cpp - Rost Glukhov | Personal site and technical blog
LLM inference looks like “just another API” — until latency spikes, queues back up, and your GPUs sit at 95% memory with no obvious explanation.
Monitoring becomes mission-critical the moment you move beyond a single-node setup or start optimizing for throughput. At that point, traditional API metrics aren’t enough. You need visibility into tokens, batching behavior, queue time, and KV cache pressure - the real bottlenecks of modern LLM systems.
This article is part of my broader observability and monitoring guide , where I cover monitoring vs observability fundamentals, Prometheus architecture, and production best practices. Here, we’ll focus specifically on monitoring LLM inference workloads .
(If you’re deciding on infrastructure, see my guide to LLM hosting in 2026. If you want a deep dive into batching mechanics, VRAM limits, and throughput vs latency trade-offs, see the LLM performance engineering guide.)
Unlike typical REST services, LLM serving is shaped by tokens , continuous batching , KV cache utilization , GPU/CPU saturation , and queue dynamics . Two requests with identical payload sizes can have radically different latency depending on max_new_tokens, concurrency, and cache reuse.
This guide is a practical, production-focused walkthrough for building LLM inference monitoring with Prometheus and Grafana :
What to measure (p95/p99 latency, tokens/sec, queue duration, cache utilization, error rate)
How to scrape /metrics from common servers (vLLM , Hugging Face TGI , llama.cpp )
PromQL examples for percentiles, saturation, and throughput
Deployment patterns with Docker Compose and Kubernetes
Troubleshooting the issues that only appear under real load
The examples are intentionally vendor-neutral. Whether you later add OpenTelemetry tracing, autoscaling, or a service mesh, the same metric model applies.
Why you should monitor LLM inference differently
Traditional API monitoring (RPS, p95 latency, error rate) is necessary but not sufficient. LLM serving adds additional axes:
1) Latency has two meanings
E2E latency : time from request received → final token returned.
Inter-token latency : time per token during decode (critical for streaming UX).
Some servers expose both. For example, TGI exposes request duration and mean time-per-token as histograms.
2) Throughput is in tokens, not requests
A “fast” service that returns 5 tokens is not comparable to one returning 500 tokens. Your “RPS” should often be “tokens/sec ”.
3) The queue is the product
If you run continuous batching, queue depth is what you sell. Watching queue duration and queue size tells you whether you’re meeting user expectations.
4) Cache pressure is an outage precursor
KV cache exhaustion (or fragmentation) often shows up as sudden latency spikes and timeouts. vLLM exposes KV cache usage as a gauge.
Metrics checklist for LLM inference monitoring
Use this as your north star. You don’t need everything on day one—but you’ll want most of it eventually.
Golden signals (LLM-flavored)
Traffic: requests/sec, tokens/sec
Errors: error rate, timeouts, OOMs, 429s (rate limiting)
Latency: p50/p95/p99 request duration; prefill vs decode latency; inter-token latency
Saturation: GPU utilization, memory usage, KV cache usage, queue size
If you need low-level visibility into GPU memory usage, temperature, and utilization outside of Prometheus (for debugging or single-node setups), see my guide to GPU monitoring applications in Linux / Ubuntu.
For a broader view of LLM observability beyond metrics — including tracing, structured logs, synthetic testing, GPU profiling, and SLO design — see my in-depth guide on observability for LLM systems.
Useful dimensions (labels)
Keep label cardinality low. Good labels:
model, endpoint, method (prefill/decode), status (success/error), instance
Avoid labels like:
raw prompt, raw user_id, request ids — these explode series count.
Exposing metrics: built-in /metrics endpoints (vLLM, TGI, llama.cpp)
The easiest path is: use the metrics the server already exposes .
vLLM: Prometheus-compatible /metrics
vLLM exposes a Prometheus-compatible /metrics endpoint (via its Prometheus metrics logger) and publishes server/request metrics with the vllm: prefix, including gauges like running requests and KV cache usage.
For container setup, OpenAI-compatible serving, and throughput-oriented runtime tuning, see vLLM Quickstart: High-Performance LLM Serving.
Example metrics you’ll typically see:
vllm:num_requests_running
vllm:num_requests_waiting
vllm:kv_cache_usage_perc
Hugging Face TGI: /metrics with queue + request histograms
TGI exposes many production-grade metrics on /metrics, including queue size, request duration, queue duration, and mean time per token.
Notable ones:
tgi_queue_size (gauge)
tgi_request_duration (histogram, e2e...