Throughput vs Goodput: The Performance Metric You Are Probably Ignoring in LLM Testing - QAInsights
Skip to content
Share via:
Copy Link
More
In this blog post, we will see the difference between throughput and goodput, why throughput alone can give you a dangerously false sense of confidence, and how goodput the metric championed by NVIDIA’s AIPerf tool tells you the truth about your LLM deployment.
If you have ever shipped a feature that looked perfectly healthy in your monitoring dashboard but fell apart under real user load, this post is for you.
Table of Contents
Toggle
What is Throughput?
Throughput is one of the oldest and most familiar metrics in performance testing. Simply put, it answers the question: how much work can the system do in a given time window?
Depending on the context, throughput is expressed as:
Requests per second (req/s) most common in API and web performance testing
Transactions per second (TPS) common in database and payment system testing
Megabytes per second (MB/s) common in file transfer and network testing
Tokens per second specific to LLM inference workloads
In a JMeter test report, the throughput number is front and center. In a k6 summary, it shows up as http_reqs. In a Grafana dashboard, it is usually one of the first panels you look at.
Throughput tells you volume. It does not tell you quality.
The Problem with Throughput Alone
Here is a scenario that should feel familiar.
You run a load test. Throughput looks great 100 req/s. No errors. You ship. Real users start complaining that the app feels sluggish or unresponsive. You go back to your dashboard. Throughput is still 100 req/s. Green across the board.
What happened?
The system was technically completing requests. But a large portion of those requests were taking 4 to 5 seconds to respond instead of the 500ms your users expect. The requests were counted as successful because they returned HTTP 200. Throughput does not care about latency. It just counts completions.
This is the gap. And in traditional web performance testing, experienced engineers close that gap by adding percentile latency checks (p95, p99) as assertions. But in LLM performance testing, the problem is deeper.
The Dosa Stall Analogy
Imagine a busy dosa stall in Coimbatore during the morning rush.
The stall owner proudly says: "We served 100 dosas this hour." That is throughput. 100 dosas per hour.
But here is the real picture:
28 dosas were served cold because the tawa was overcrowded
15 dosas arrived 20 minutes after the order because the batter queue was too long
5 dosas were undercooked
Only 52 dosas were served hot, crispy, and within the 5-minute promise. That is goodput. 52 dosas per hour.
The stall is technically operating at 100 dosas/hour. But only 52 of them actually met the quality standard the customer was promised.
Now imagine this stall is your LLM API, and each dosa is an inference request. The "hot and crispy within 5 minutes" rule is your SLO.
What is Goodput?
Goodput is the number of requests per second that completed and met all your defined SLO constraints.
This definition comes directly from NVIDIA’s AIPerf tool (the successor to GenAI-Perf), which is the industry standard for LLM inference benchmarking. In AIPerf, you define goodput constraints when you run a benchmark:
aiperf profile \<br>--model "llama-3.1-70b" \<br>--url http://inference-server:8000 \<br>--goodput-ttft 500 \<br>--goodput-itl 100
This tells the tool: only count a request toward goodput if:
Time to First Token (TTFT) was under 500ms, AND
Inter-Token Latency (ITL) was under 100ms
A request that completes but violates either constraint does not count. It is a failed request from the user’s perspective, even if the HTTP status code was 200.
How Goodput Works in LLM Performance Testing
LLM inference has two latency metrics that users feel directly:
Time to First Token (TTFT) is how long the user waits before they see the first word of the response. This is what makes an LLM feel fast or laggy. A high TTFT means users are staring at a blank screen or a loading spinner.
Inter-Token Latency (ITL) is the delay between each token in the streamed response. A high ITL makes the text appear to stutter or pause mid-sentence, which breaks the feeling of a natural conversation.
Both of these metrics degrade under load. As concurrency increases, the inference server queue backs up. TTFT climbs first requests sit waiting to be processed. ITL can follow if GPU compute is saturated.
Throughput stays stable through all of this. The server is still completing requests. It is just that the user experience is becoming progressively worse.
Goodput captures that degradation directly. When TTFT crosses your SLO threshold, those requests stop contributing to goodput. The goodput number drops visibly, even while throughput holds steady.
As I showed in an earlier post, 99% of Requests Failed and My Dashboard Showed...