The Infrastructure Behind Making Local LLM Agents Useful

The Infrastructure Behind Making Local LLM Agents Actually Useful | by Hussen Mohammed Ibrahim | May, 2026 | MediumSitemapOpen in appSign up Sign in

Medium Logo

Get app Write

The Infrastructure Behind Making Local LLM Agents Actually Useful

Hussen Mohammed Ibrahim

17 min read· 5 hours ago

Listen

Running a language model locally sounds straightforward. Download the weights, start the server, and send requests. That works for a chatbot, but it doesn’t automatically work for an agent. In my case, I’ve been building an agent for automated single-cell RNA-seq analysis. The idea is that, given raw data, the agent can run the full pipeline on its own, deciding which tools to call, reading the results, and working through the analysis step by step. You might ask why not just use something like Claude Code with a single-cell analysis Skill. The short answer is that for scientific workflows, that’s not quite enough. Skills are ultimately prompts and can thus be overridden or ignored. More importantly, scientific work requires reproducibility and provenance tracking: knowing exactly which parameters were used, which cells were filtered, which clustering resolution produced which result, etc. That record needs to be structured and persistent, not reconstructed from a conversation. For long-running sessions, you also need explicit world state management rather than relying on context compaction to preserve what matters. These are things you have to build deliberately. Building all of these on top of a local model also means you own the infrastructure, and that’s what I’m going to be focusing on here. The agent we built runs on institutional HPC hardware using recent open-weight models. It is easy to assume open-weight models are not strong enough for this kind of work. But that is becoming less true. Recent releases like Qwen3.6–27B and Gemma 4–31B are genuinely useful for structured, tool-driven workloads (If you’re interested in keeping up with how open source is evolving, Interconnects AI has interesting stuff you can follow). And that’s one of the main reasons why local hosting makes sense here. Our agent also supports cloud APIs like Claude and GPT, but when you use those, all of the infrastructure I’m about to describe is invisible to you. Someone else has already solved it. When you host the model yourself, those problems become yours. When I ran the model the first time, it worked in a narrow sense. The model would call tools, the tools would run, and the analysis would move forward. But it wasn’t really usable yet. A simple single-cell analysis could have 50–80 tool calls in a loop. Every call carried the same fixed baggage: the system prompt, the tool schemas, and the growing conversation history. For this agent, the system prompt and tool schemas alone were about 36k tokens. Before the model could decide anything, it first had to read tens of thousands of tokens of instructions and tool definitions. Then it had to do that again on the next iteration. And again on the one after that. Each iteration took 10 to 15 seconds. And a long session would eventually crash out with context overflow errors, taking all the in-memory analysis state with it. This article is about fixing both of those problems. The first part covers making inference faster through a set of compounding optimizations to the vLLM inference server (an open-source inference engine built for high-throughput LLM serving). The second part covers keeping long sessions alive through better context management and a structured world state that survives trimming. I ran experiments on A100 and H100 GPUs to measure the impact of each change, and those are described below. Part 1: Making Inference Fast Before getting into the individual optimizations, it helps to understand what’s actually happening on each iteration of the agent loop. The diagram below shows a single iteration: the agent sends a request containing the system prompt, tool schemas, and the full conversation history to the model. The model reads all of it and decides which tools to call. The tool runs and returns a result, and that result gets appended to the history before the next iteration begins. Two things are worth noting here. The fixed prefix, which is the system prompt plus tool schemas, is roughly 36k tokens and gets sent on every single call. And the conversation history grows with every iteration. By iteration 40, the model is no longer reading a short instruction. It’s reading a long analysis transcript with many tool calls, tool outputs, intermediate results, etc. Both of these things affect the performance of the agent. Press enter or click to view image in full size

Figure 1 : One iteration of the agent loop. The fixed prefix repeats on every call, and the conversation history grows with each iteration1.1 CUDA Graphs: Reducing Hundreds of Instructions Per Token to One To understand this one, it helps to know what happens inside a GPU when...

The Infrastructure Behind Making Local LLM Agents Useful

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast