Micro-Agent: Beat Frontier Models with Collaboration Inside Model API

matt_d1 pts0 comments

Micro-Agent: Beat Frontier Models with Collaboration inside Model API | vLLM Blog

Menu<br>Theme

Table of Contents

Everyone is watching for the next frontier model.

The more interesting layer may be the one in front of it.

Routers are becoming the control plane for AI inference. Their first role was<br>practical: route the right request to the right model. That already matters<br>because production AI is no longer a one-model world.

A router can cut cost by deciding when a request deserves a frontier model and<br>when an open-source or local model is enough. It can make safety policy<br>executable by sending sensitive domains to stricter models, stricter filters, or<br>stronger review paths. It can coordinate cloud and edge, keeping private or<br>low-latency intent local while escalating harder work to the cloud.

Those are important jobs.

But the next router job is more interesting:

A router can make the model better.

Not by changing weights. Not by asking every application to build a bespoke<br>agent graph. By turning one model API call into a bounded collaboration inside<br>the serving layer.

Figure 1: The router is moving from model selection to capability construction.<br>This is why Sakana Fugu landed so loudly: it made a<br>commercial product out of a simple but powerful idea, that a "model" can be a<br>surface, and behind that surface can be a team. The research around this idea,<br>including the Fugu technical report and<br>coordination papers such as Conductor and<br>Trinity, gives useful language for thinking<br>about orchestration.

But the vLLM Semantic Router vision is different in where it puts the<br>abstraction. Collaboration should not live only inside one commercial endpoint<br>or one application-specific agent graph. It should become an open serving<br>primitive.

vLLM Semantic Router brings that idea into the open serving layer. The user<br>still calls one model:

code]:block [&>code]:w-fit [&>code]:min-w-full font-mono" tabindex="0" data-language="json" data-theme="github-dark-default github-light-default">{<br>"model": "vllm-sr/auto",<br>"messages": [{"role": "user", "content": "..."}]

Behind that stable model identity, the router can select a recipe, fan out to<br>workers, collect a quorum, verify disagreement, synthesize a final answer,<br>repair the output contract, and return one normal OpenAI-compatible response.

The point is not to expose complexity.

The point is to make collaboration feel like a model.

The Looper Is the Runtime

In vLLM Semantic Router, the looper is the execution runtime for bounded<br>micro-agents.

A request enters the router as an ordinary chat completion. The router extracts<br>signals, projects them into task-shape or risk bands, matches a decision, and<br>then chooses an algorithm. That algorithm may be a normal single-model route,<br>or it may be a looper route.

Today, the main looper patterns are:

Confidence : a sequential escalation loop. It tries a cheaper candidate<br>first, measures confidence, and escalates only when the score is too low.

Ratings : a bounded fan-out loop. It runs multiple candidates under a hard<br>concurrency cap and aggregates them with rating-aware weights.

ReMoM : repeated mixture-of-model reasoning. It fans out breadth samples,<br>waits for enough successful responses, and runs a final synthesis round.

Fusion : a panel-judge-final pattern. Independent model responses become<br>evidence for a judge and finalizer.

Workflows : a micro-agent workflow runtime. It supports static roles or a<br>dynamic planner, executes bounded worker steps, and synthesizes a final<br>response.

Figure 2: Looper algorithms run inside the router while preserving the model API surface.<br>The implementation details matter. A looper is not a slogan for "ask more<br>models." It is a small runtime with budget, topology, trace, and failure policy.

Confidence: spend escalation only on hard cases

Confidence is the cost-aware loop. It starts with a smaller or cheaper candidate,<br>then evaluates whether the answer is confident enough to stop. The confidence<br>signal can come from token-level log probability, logprob margin, a hybrid<br>score, self-verification, or an AutoMix-style entailment verifier.

If the score passes the threshold, the router returns immediately. If the score<br>is too low, the route escalates to the next candidate. The important part is not<br>that escalation exists. It is that escalation becomes explicit router policy:<br>thresholds, failure behavior, and stopping conditions are visible and tunable.

Figure 3: Confidence turns escalation into a measured stopping policy.<br>Ratings: parallel quality under a hard cap

Ratings is the controlled ensemble loop. It launches several candidates in<br>parallel, but only up to a configured max_concurrent cap. That makes it useful<br>when a route should benefit from multiple model views without turning every<br>request into an unbounded fan-out.

The router collects successful responses, applies rating-aware aggregation, and<br>handles failures according to the route policy. In practice, Ratings is a good<br>fit...

model router route looper confidence agent

Related Articles