Monitoring LLM Inference with Prometheus and Grafana (vLLM, TGI, Llama.cpp)

Monitor LLM Inference in Production (2026): Prometheus & Grafana for vLLM, TGI, llama.cpp - Rost Glukhov | Personal site and technical blog

LLM inference looks like “just another API” — until latency spikes, queues back up, and your GPUs sit at 95% memory with no obvious explanation.

Monitoring becomes mission-critical the moment you move beyond a single-node setup or start optimizing for throughput. At that point, traditional API metrics aren’t enough. You need visibility into tokens, batching behavior, queue time, and KV cache pressure - the real bottlenecks of modern LLM systems.

This article is part of my broader observability and monitoring guide , where I cover monitoring vs observability fundamentals, Prometheus architecture, and production best practices. Here, we’ll focus specifically on monitoring LLM inference workloads .

(If you’re deciding on infrastructure, see my guide to LLM hosting in 2026. If you want a deep dive into batching mechanics, VRAM limits, and throughput vs latency trade-offs, see the LLM performance engineering guide.)

Unlike typical REST services, LLM serving is shaped by tokens , continuous batching , KV cache utilization , GPU/CPU saturation , and queue dynamics . Two requests with identical payload sizes can have radically different latency depending on max_new_tokens, concurrency, and cache reuse.

This guide is a practical, production-focused walkthrough for building LLM inference monitoring with Prometheus and Grafana :

What to measure (p95/p99 latency, tokens/sec, queue duration, cache utilization, error rate)

How to scrape /metrics from common servers (vLLM , Hugging Face TGI , llama.cpp )

PromQL examples for percentiles, saturation, and throughput

Deployment patterns with Docker Compose and Kubernetes

Troubleshooting the issues that only appear under real load

The examples are intentionally vendor-neutral. Whether you later add OpenTelemetry tracing, autoscaling, or a service mesh, the same metric model applies.

Why you should monitor LLM inference differently

Traditional API monitoring (RPS, p95 latency, error rate) is necessary but not sufficient. LLM serving adds additional axes:

1) Latency has two meanings

E2E latency : time from request received → final token returned.

Inter-token latency : time per token during decode (critical for streaming UX).

Some servers expose both. For example, TGI exposes request duration and mean time-per-token as histograms.

2) Throughput is in tokens, not requests

A “fast” service that returns 5 tokens is not comparable to one returning 500 tokens. Your “RPS” should often be “tokens/sec ”.

3) The queue is the product

If you run continuous batching, queue depth is what you sell. Watching queue duration and queue size tells you whether you’re meeting user expectations.

4) Cache pressure is an outage precursor

KV cache exhaustion (or fragmentation) often shows up as sudden latency spikes and timeouts. vLLM exposes KV cache usage as a gauge.

Metrics checklist for LLM inference monitoring

Use this as your north star. You don’t need everything on day one—but you’ll want most of it eventually.

Golden signals (LLM-flavored)

Traffic: requests/sec, tokens/sec

Errors: error rate, timeouts, OOMs, 429s (rate limiting)

Latency: p50/p95/p99 request duration; prefill vs decode latency; inter-token latency

Saturation: GPU utilization, memory usage, KV cache usage, queue size

If you need low-level visibility into GPU memory usage, temperature, and utilization outside of Prometheus (for debugging or single-node setups), see my guide to GPU monitoring applications in Linux / Ubuntu.

For a broader view of LLM observability beyond metrics — including tracing, structured logs, synthetic testing, GPU profiling, and SLO design — see my in-depth guide on observability for LLM systems.

Useful dimensions (labels)

Keep label cardinality low. Good labels:

model, endpoint, method (prefill/decode), status (success/error), instance

Avoid labels like:

raw prompt, raw user_id, request ids — these explode series count.

Exposing metrics: built-in /metrics endpoints (vLLM, TGI, llama.cpp)

The easiest path is: use the metrics the server already exposes .

vLLM: Prometheus-compatible /metrics

vLLM exposes a Prometheus-compatible /metrics endpoint (via its Prometheus metrics logger) and publishes server/request metrics with the vllm: prefix, including gauges like running requests and KV cache usage.

For container setup, OpenAI-compatible serving, and throughput-oriented runtime tuning, see vLLM Quickstart: High-Performance LLM Serving.

Example metrics you’ll typically see:

vllm:num_requests_running

vllm:num_requests_waiting

vllm:kv_cache_usage_perc

Hugging Face TGI: /metrics with queue + request histograms

TGI exposes many production-grade metrics on /metrics, including queue size, request duration, queue duration, and mean time per token.

Notable ones:

tgi_queue_size (gauge)

tgi_request_duration (histogram, e2e...

Monitoring LLM Inference with Prometheus and Grafana (vLLM, TGI, Llama.cpp)

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

It's Not Just X. It's Y