The Self-Evolving Model Router

suhaselcuk1 pts0 comments

The Self-Evolving Model Router — VDF AI White Paper

WHITE PAPER v1.0 May 2026 VDF-WP-2026-002<br>The Self-Evolving Model Router.

A composable, six-tier dispatch architecture that turns model selection from a static configuration<br>into a continuously-learning decision — combining policy enforcement, prompt-aware retrieval,<br>rule-based filtering, predictive re-ranking, contextual bandits, and challenger exploration<br>under a single, gracefully-degrading routing surface.

Authors VDF AI Research Team<br>Read time 20 min<br>License CC BY 4.0

Download PDF Read Online

ABSTRACT<br>Enterprise dispatch of large language models has historically been a configuration decision:<br>operators bind a model to a workload and live with the choice. Real fleets, however, are<br>non-stationary. Provider quotas oscillate, latency drifts on shared cloud endpoints, capabilities<br>evolve as new model families arrive weekly, and the cost-quality-energy frontier shifts under<br>the operator's feet[15][13].<br>A static binding is therefore a slowly-failing decision, and the problem is not solved by adding<br>an A/B test on top of a static dispatcher — it is solved by treating routing itself as a<br>non-stationary contextual decision.

This white paper documents how VDF AI Networks operationalises that view. Every request flows<br>through a six-tier dispatcher: policy enforcement, prompt-aware retrieval shortlisting,<br>rule-based filtering with a multi-objective scorer, predictive re-ranking on per-arm history,<br>contextual-bandit selection under a disjoint-per-arm LinUCB learner[2],<br>and challenger exploration that dual-routes a small fraction of traffic for live preference<br>learning. Each tier is independently feature-gated and degrades to the next-simpler strategy<br>when its signal is unavailable. The composition, not any single tier, is the contribution.

The router is self-evolving in three coupled senses. Online, every completed request becomes<br>a reward observation that updates the chosen arm via a rank-one Sherman–Morrison<br>update[10]; failures are folded back as a bounded penalty<br>rather than dropped; and an offline trainer batches the run vault to re-derive priors that<br>are atomically swapped into the live policy. We describe the design parameters, the<br>graceful-degradation envelope, and the position of the work relative to the recent<br>cost-quality routing literature. The paper is a design account and is deliberate<br>about not over-claiming measured outcomes.

Keywords<br>contextual bandits · LinUCB · model routing · disjoint per-arm learning · prompt-embedding retrieval · multi-objective scoring · online/offline learning duality · LLM serving · graceful degradation · policy-bound dispatch

AT A GLANCE<br>Six numbers that anchor the paper

Decision tiers 6<br>independently feature-gated layers in the dispatch stack

Context dim 64<br>sparse hashed features encoded per request

Exploration α = 0.8<br>UCB confidence bonus on the contextual bandit

Window ~200 obs<br>per-model rolling latency and throughput window

Challenger ~2%<br>of traffic dual-routed for live preference learning

Failure reward 0.15<br>bounded penalty fed back to the bandit on timeout or error

FIGURE 1<br>The six-tier router — per-request lifecycle

Inputs arrive from the workflow specification on the left and exit as a routing decision and<br>an ordered failover list on the right. Every tier is feature-gated; the dashed return loop<br>depicts the online/offline learning duality that gives the router its name.

Fig. 1. Per-request routing lifecycle. Each tier is feature-gated and fails open<br>to the next-simpler strategy when its signal is unavailable. The dashed return loop shows the<br>online reward update and the offline retraining cycle that re-derives priors.

SECTION 1 Introduction & motivation

Three things change beneath an enterprise dispatcher in any given quarter. Provider quotas and<br>rate limits drift, sometimes overnight; latency on shared cloud endpoints fluctuates with<br>datacentre load and is correlated across tenants but invisible to any individual one; and the<br>model catalog itself evolves — new families arrive, established ones deprecate, and the price-<br>quality frontier moves[15]. None of these are visible to a<br>dispatcher that selects models by static configuration.

A buyer accepting this state of affairs typically responds in one of three ways: pin the safest<br>model and pay the premium, pin the cheapest model and absorb the variance, or layer an offline<br>A/B test on top of a static dispatcher and update the configuration by hand. None of the three<br>scales. The first wastes capacity; the second wastes outcomes; the third turns the dispatcher<br>into a manual rebalancing job. What is needed is a routing layer that treats the choice of model<br>as a non-stationary contextual decision — one that absorbs the drift instead of papering over it.

The Self-Evolving Model Router is the dispatch tier of VDF AI Networks. It is designed around the<br>observation that every routing decision is a bandit problem with a context vector and a stream...

model tier decision routing router learning

Related Articles