Session-Aware Agentic Routing: Continuity-Aware Model Selection for Long-Horizon

matt_d1 pts0 comments

Session-Aware Agentic Routing: Continuity-Aware Model Selection for Long-Horizon LLM Agents | vLLM Blog

Menu<br>Theme

Table of Contents

Long-horizon LLM agents create a routing problem that single-turn prompt routers were not designed to solve. A router still needs to know which model is best for the current request, but it also needs to know when switching models would break the session.

This post introduces Session-Aware Agentic Routing (SAAR) , a session-aware model selection policy in vLLM Semantic Router. SAAR keeps semantic routing, but adds router-owned session memory, hard locks around tool loops and non-portable provider state, safe reset boundaries, prefix-cache-aware switch pricing, and replayable traces.

Across 21,600 deterministic turns, SAAR cuts model switches by 79.29% , eliminates 3,836 unsafe switches, and reduces estimated physical-model cost by 78.71% . Across 2,896 live AMD ROCm requests, it preserves session continuity with 0 observed violations.

Figure 1: Long-horizon agents need routing decisions that understand the session trajectory, not only the latest prompt.

From Prompt Routing To Session Routing

vLLM Semantic Router started from a simple systems observation: not every request should take the same path through an inference stack. A short factual question, a security-sensitive prompt, a multimodal request, a hard reasoning task, and a domain-specific query may all deserve different treatment.

The first generation of that idea was prompt routing. The router extracted signals from the current request, matched a routing decision, and selected an appropriate path. Iris made those signals composable. Athena made the router more strategic by expanding model selection, memory, replay, long-context signals, multimodal primitives, and AMD ROCm deployment paths.

Agents change the unit of routing again.

A coding or research agent is not one prompt. It is a session. It plans, calls tools, receives tool outputs, edits files, runs tests, recovers from errors, pauses, resumes, and often sends very short follow-up messages such as "continue", "fix it", "run that again", or "use the previous result." Those turns are meaningful only because of the trajectory that came before them.

That is why this milestone matters for Semantic Router. The router is no longer answering only:

Which model should handle this request?

For agent traffic, the router also has to answer:

Is it safe to switch models inside this session right now?

That second question is what SAAR is designed to handle.

Why Single-Turn Routing Breaks Down For Agents

Single-turn routing can be locally correct and still be wrong for the session.

Consider a typical tool-using agent loop:

TurnWhat the client sendsWhat a prompt router seesWhat a session router must remember1"Refactor this module and run the tests."A coding taskThe session has started on a physical model2The model emits a tool callA model responseThe next tool result belongs to the same model3The client sends the tool resultA terse observationThe model that asked for the tool should receive the result4The user says "fix the failing case"A short follow-upThe instruction depends on prior code, test output, and routing state5The session idles and resumes laterA new short messageThe router can reconsider whether the old model is still worth holding

The latest message alone does not contain enough information. A prompt router may decide that the tool result looks cheap and send it to a smaller model. It may see a generic "continue" and re-run the normal selector. It may miss that provider-managed continuation state belongs to one physical backend. It may discard a warm prefix cache for a frontier model because the current message is short.

Each of those mistakes has a different failure mode:

A tool result can go to a model that did not make the tool call.

A non-portable continuation id can be sent to the wrong physical backend.

A long, warm session can lose prefix locality and become unnecessarily expensive.

A logical model such as auto can become hard to debug because users no longer know which physical model actually served the turn.

The important point is not that agents should never switch models. They should. A good router should still move from a cheap model to a stronger model when the task becomes harder, and it should move back when the session reaches a safe boundary. The problem is that the router needs session context to know which moments are safe.

The SAAR Design

SAAR keeps the existing Semantic Router decision pipeline. Signals are still extracted from the request, decisions are still matched, and model-selection algorithms still rank candidate models inside a matched decision.

SAAR adds a session-control layer around that result.

Figure 2: SAAR combines router memory, hard locks, reset boundaries, switch economics, and replayable traces before selecting a physical model.

There are five pieces:

PieceWhat it stores or decidesWhy it...

model session router routing tool prompt

Related Articles