Stop generating what you have

44za121 pts0 comments

Stop generating what you already have | Aazar

Skip to main content

Stop generating what you already have

2026-06-26<br>ai, engineering, llm, performance

A teammate pinged me in the morning. They were using a self-hosted LLM I maintain to convert large text documents into structured JSON. Each extraction was taking 42 to 50 seconds. They needed it faster.

The model is a 26B parameter model, AWQ quantized, running on vLLM on a single GPU. Solid setup. Not exotic hardware. The task was straightforward: feed in a long text document, get back structured fields. Names, dates, addresses, section summaries.

42 seconds per document is not a latency problem. It is a design problem. I dug in.

The bottleneck is always output tokens

Everyone optimizes prompt tokens. They trim context, compress system prompts, switch to shorter models. None of this matters if your problem is output latency.

Here is the math. On a self-hosted vLLM endpoint, input tokens are batch-processed in parallel on the GPU. The entire prompt is consumed in one forward pass. Output tokens are generated auto-regressively, one at a time, each requiring a full forward pass through the model. Input is parallel. Output is serial.

If your extraction prompt takes 900 input tokens and generates 850 output tokens, the input processing takes maybe 200 milliseconds. The output generation takes 40 seconds. You are not waiting on the prompt. You are waiting on generation.

What the model was actually doing

I logged the token breakdown for a typical extraction call. The model was generating 854 completion tokens. Of those, roughly 600 were summary text copied verbatim from the input document.

The LLM was acting as a copy machine.

When you ask a model to extract a "summary" field from a document and put it in JSON, it does not summarize. It copies. Word for word. The same text that is already sitting in your prompt gets generated back to you one token at a time. You sent it once as input (fast, batch-parallel). It sends it back as output (slow, serial). That round trip is the entire latency problem.

The insight: ask for pointers, not content

If the model is going to copy text verbatim anyway, stop asking it to copy. Ask it for the location of the text instead.

Instead of:

"summary": "Cross-functional team building scalable frontend architecture with React and TypeScript, collaborating with designers and backend engineers to deliver accessible web applications."<br>Ask for:

"summary_start": "Cross-functional team",<br>"summary_end": "accessible web applications."<br>First 3 words and last 3 words. 12 tokens instead of 300. Then slice the summary from the source document yourself using str.find(). Zero LLM tokens spent on the actual content. The model tells you where the text starts and ends. You do the copying in under 1 millisecond.

This works because the summaries in extraction pipelines are almost always verbatim copies from the source. The model is not generating new content. It is locating existing content and transcribing it. So ask it to locate, not transcribe.

Splitting the call

Once I realized output tokens were the problem, I split the single extraction call into many small parallel calls.

The original approach: one call, one prompt, one massive JSON response with every field including full summary text. 854 output tokens, 42 seconds.

The split approach: 6 scalar extraction calls in parallel (fullname, headline, location, etc.), each generating 2 to 30 tokens. Plus one call to list the section headers found in the document. All 7 calls fire simultaneously and finish in about 3 seconds because the longest output is 30 tokens.

Then a second phase: for each section found in phase 1, one parallel call extracts the metadata plus summary anchors for that section. 5 sections means 5 parallel calls, each generating about 50 tokens. Another 3 seconds.

Total: 6 seconds. 13 parallel calls instead of 1 sequential call.

Why two phases instead of flat parallelism

My first attempt split everything flat: one call per field, all in parallel. It ran in 2.2 seconds. The problem was the section-level fields. When you ask the model "extract details for section 4," it sometimes skips sections, duplicates them, or invents ones that do not exist.

The model is reliable at listing what it sees. It is unreliable at counting. "List all sections you can find" produces a clean, complete list every time. "Extract section 4" produces chaos.

So phase 1 asks for the list. Phase 2 uses that list to make targeted extraction calls. The serial dependency between phases costs 3 seconds. The reliability gain is worth it.

Where this applies

This is not specific to document extraction. Any LLM pipeline where the output contains large blocks of text copied from the input has the same problem. Resume parsing, contract analysis, product page extraction, log summarization, meeting transcription. If the model is copying, you are paying serial output token costs for text you already...

tokens model output text extraction seconds

Related Articles