LLM agent performance is a distributed systems problem

High-performance AI agents are distributed systems - FixBugs Blog | FixBugs | From Alert to Validated Fix Skip to main content

ai agentsHigh-performance AI agents are distributed systems A technical case study on making FixBugs faster by treating LLM agents like distributed systems: chunking, streaming, queues, token math, and concurrency limits. Kirti Rathore·Jun 1, 2026·11 min read "Codex took 6 hours to implement this seemingly simple refactor".

"I think Research mode on Perplexity is stuck."

We all know LLM APIs are slow, and are content with staring at a spinner while the model slowly emits tokens.

But what happens when you're building AI agents that need to be low latency?

We hit this while building FixBugs, an AI debugging agent that reads bug reports, logs, code, screenshots, traces, and issue comments, then reproduces the bug and finally generates a validated fix. The product has a simple promise: every code change is verified to do only the necessary work to fix the issue.

The implementation is not simple.

Bug reports and their associated logs/metrics/traces can contain too much context for one model call. A repository can have hundreds of files. Logs can be larger than the model's useful context window. The final answer may need thousands of output tokens. And if the agent takes ten minutes to say anything useful, the user assumes it is broken.

Summarization, also referred to as compaction, is the usual way to work with huge context. However, summarization is slow and often loses essential context.

Modern coding agents like Claude Code and Cursor rely heavily on blindly grepping through log files and reading from specific offsets. The effective context window the coding agent is allowed to process at once is smaller than the total context window. GPT 5.5 for example has a context window of 400K tokens but it's 'input context' is closer to 258K tokens.

Once you step beyond conversational agent loops, other interesting patterns become usable.

You realize the underlying performance engineering problems are similar to those encountered when optimizing large distributed systems.

Scatter-gather. Pipelining. Queues. Backpressure. Streaming. Serializability. These are the problems we spent the most time thinking about.

start with token math

Most agent performance discussions start in the wrong place.

They ask:

Which model is fastest in terms of tokens/sec?

That is a useful question later. The first question is:

How many input tokens and output tokens does this task need?

LLM latency has two different pieces that matter to the user.

Time to first token is how long the user waits before the model starts responding. Token throughput is a measure of how much time it takes to get the full answer.

They are not the same problem.

Prefill affects time to first token. Decode affects the stream of output tokens after that.

In the prefill phase, the model processes the input context and prepares the key/value cache used to generate the first new token. In the decode phase, the model generates output tokens one at a time autoregressively.

For a practical agent, a crude mental model is enough:

latency model

TTFT output tokens time/token

Input tokens are not free. They hit prefill and therefore time to first token.

prefill cost example

20,000 input tokens 0.05ms/token 1,000ms 1s TTFT

The constant is model- and provider-specific; the shape of the cost is the useful part.

But long answers are expensive in a different way. Every output token has to be generated. If your agent asks the model to explain every file in a repository, your user is paying for that decision in wall-clock time.

This matters because debugging agents are usually output-heavy. They do not just answer "yes" or "no." They produce hypotheses, evidence, file rankings, reproduction plans, code diffs, and validation notes.

Output tokens dominate faster than people expect.

the 10-minute file search

The biggest bottleneck in early FixBugs was not repository parsing.

It was asking the LLM which files were relevant to a bug.

The naive version looked reasonable:

Gather the bug context.

Gather the repository files.

Put all relevant context into one prompt.

Ask the model to rank files and explain why.

For a small repo, this works.

For 50 files, it turns into a bad batch job disguised as a chat request.

If the model emits 30,000 output tokens and the endpoint gives you 50 output tokens per second, you are waiting about 600 seconds. Ten minutes. That is before retries, rate limits, or any downstream fix generation.

To get faster performance, we realized we had to use as much parallelism as possible.

file relevance stage

one giant call

30,000 tokens

50 tokens/s

600s

about 10 minutes before retries or downstream work

16 independent calls

16 50 tokens/s 800 tokens/s

roughly 40 seconds for the demo workload

Each file was decomposed into chunks. Each chunk got its own relevance call. Those calls ran...

LLM agent performance is a distributed systems problem

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy