Micro-Agent: Beat Frontier Models with Collaboration Inside Model API

Micro-Agent: Beat Frontier Models with Collaboration inside Model API | vLLM Blog

Menu Theme

Table of Contents

Everyone is watching for the next frontier model.

The more interesting layer may be the one in front of it.

Routers are becoming the control plane for AI inference. Their first role was practical: route the right request to the right model. That already matters because production AI is no longer a one-model world.

A router can cut cost by deciding when a request deserves a frontier model and when an open-source or local model is enough. It can make safety policy executable by sending sensitive domains to stricter models, stricter filters, or stronger review paths. It can coordinate cloud and edge, keeping private or low-latency intent local while escalating harder work to the cloud.

Those are important jobs.

But the next router job is more interesting:

A router can make the model better.

Not by changing weights. Not by asking every application to build a bespoke agent graph. By turning one model API call into a bounded collaboration inside the serving layer.

Figure 1: The router is moving from model selection to capability construction. This is why Sakana Fugu landed so loudly: it made a commercial product out of a simple but powerful idea, that a "model" can be a surface, and behind that surface can be a team. The research around this idea, including the Fugu technical report and coordination papers such as Conductor and Trinity, gives useful language for thinking about orchestration.

But the vLLM Semantic Router vision is different in where it puts the abstraction. Collaboration should not live only inside one commercial endpoint or one application-specific agent graph. It should become an open serving primitive.

vLLM Semantic Router brings that idea into the open serving layer. The user still calls one model:

code]:block [&>code]:w-fit [&>code]:min-w-full font-mono" tabindex="0" data-language="json" data-theme="github-dark-default github-light-default">{ "model": "vllm-sr/auto", "messages": [{"role": "user", "content": "..."}]

Behind that stable model identity, the router can select a recipe, fan out to workers, collect a quorum, verify disagreement, synthesize a final answer, repair the output contract, and return one normal OpenAI-compatible response.

The point is not to expose complexity.

The point is to make collaboration feel like a model.

The Looper Is the Runtime

In vLLM Semantic Router, the looper is the execution runtime for bounded micro-agents.

A request enters the router as an ordinary chat completion. The router extracts signals, projects them into task-shape or risk bands, matches a decision, and then chooses an algorithm. That algorithm may be a normal single-model route, or it may be a looper route.

Today, the main looper patterns are:

Confidence : a sequential escalation loop. It tries a cheaper candidate first, measures confidence, and escalates only when the score is too low.

Ratings : a bounded fan-out loop. It runs multiple candidates under a hard concurrency cap and aggregates them with rating-aware weights.

ReMoM : repeated mixture-of-model reasoning. It fans out breadth samples, waits for enough successful responses, and runs a final synthesis round.

Fusion : a panel-judge-final pattern. Independent model responses become evidence for a judge and finalizer.

Workflows : a micro-agent workflow runtime. It supports static roles or a dynamic planner, executes bounded worker steps, and synthesizes a final response.

Figure 2: Looper algorithms run inside the router while preserving the model API surface. The implementation details matter. A looper is not a slogan for "ask more models." It is a small runtime with budget, topology, trace, and failure policy.

Confidence: spend escalation only on hard cases

Confidence is the cost-aware loop. It starts with a smaller or cheaper candidate, then evaluates whether the answer is confident enough to stop. The confidence signal can come from token-level log probability, logprob margin, a hybrid score, self-verification, or an AutoMix-style entailment verifier.

If the score passes the threshold, the router returns immediately. If the score is too low, the route escalates to the next candidate. The important part is not that escalation exists. It is that escalation becomes explicit router policy: thresholds, failure behavior, and stopping conditions are visible and tunable.

Figure 3: Confidence turns escalation into a measured stopping policy. Ratings: parallel quality under a hard cap

Ratings is the controlled ensemble loop. It launches several candidates in parallel, but only up to a configured max_concurrent cap. That makes it useful when a route should benefit from multiple model views without turning every request into an unbounded fan-out.

The router collects successful responses, applies rating-aware aggregation, and handles failures according to the route policy. In practice, Ratings is a good fit...

Micro-Agent: Beat Frontier Models with Collaboration Inside Model API

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

Italy's Meloni says Trump 'made up' story that she 'begged' him for photo at G7