The Self-Evolving Model Router

The Self-Evolving Model Router — VDF AI White Paper

WHITE PAPER v1.0 May 2026 VDF-WP-2026-002 The Self-Evolving Model Router.

A composable, six-tier dispatch architecture that turns model selection from a static configuration into a continuously-learning decision — combining policy enforcement, prompt-aware retrieval, rule-based filtering, predictive re-ranking, contextual bandits, and challenger exploration under a single, gracefully-degrading routing surface.

Authors VDF AI Research Team Read time 20 min License CC BY 4.0

Download PDF Read Online

ABSTRACT Enterprise dispatch of large language models has historically been a configuration decision: operators bind a model to a workload and live with the choice. Real fleets, however, are non-stationary. Provider quotas oscillate, latency drifts on shared cloud endpoints, capabilities evolve as new model families arrive weekly, and the cost-quality-energy frontier shifts under the operator's feet[15][13]. A static binding is therefore a slowly-failing decision, and the problem is not solved by adding an A/B test on top of a static dispatcher — it is solved by treating routing itself as a non-stationary contextual decision.

This white paper documents how VDF AI Networks operationalises that view. Every request flows through a six-tier dispatcher: policy enforcement, prompt-aware retrieval shortlisting, rule-based filtering with a multi-objective scorer, predictive re-ranking on per-arm history, contextual-bandit selection under a disjoint-per-arm LinUCB learner[2], and challenger exploration that dual-routes a small fraction of traffic for live preference learning. Each tier is independently feature-gated and degrades to the next-simpler strategy when its signal is unavailable. The composition, not any single tier, is the contribution.

The router is self-evolving in three coupled senses. Online, every completed request becomes a reward observation that updates the chosen arm via a rank-one Sherman–Morrison update[10]; failures are folded back as a bounded penalty rather than dropped; and an offline trainer batches the run vault to re-derive priors that are atomically swapped into the live policy. We describe the design parameters, the graceful-degradation envelope, and the position of the work relative to the recent cost-quality routing literature. The paper is a design account and is deliberate about not over-claiming measured outcomes.

Keywords contextual bandits · LinUCB · model routing · disjoint per-arm learning · prompt-embedding retrieval · multi-objective scoring · online/offline learning duality · LLM serving · graceful degradation · policy-bound dispatch

AT A GLANCE Six numbers that anchor the paper

Decision tiers 6 independently feature-gated layers in the dispatch stack

Context dim 64 sparse hashed features encoded per request

Exploration α = 0.8 UCB confidence bonus on the contextual bandit

Window ~200 obs per-model rolling latency and throughput window

Challenger ~2% of traffic dual-routed for live preference learning

Failure reward 0.15 bounded penalty fed back to the bandit on timeout or error

FIGURE 1 The six-tier router — per-request lifecycle

Inputs arrive from the workflow specification on the left and exit as a routing decision and an ordered failover list on the right. Every tier is feature-gated; the dashed return loop depicts the online/offline learning duality that gives the router its name.

Fig. 1. Per-request routing lifecycle. Each tier is feature-gated and fails open to the next-simpler strategy when its signal is unavailable. The dashed return loop shows the online reward update and the offline retraining cycle that re-derives priors.

SECTION 1 Introduction & motivation

Three things change beneath an enterprise dispatcher in any given quarter. Provider quotas and rate limits drift, sometimes overnight; latency on shared cloud endpoints fluctuates with datacentre load and is correlated across tenants but invisible to any individual one; and the model catalog itself evolves — new families arrive, established ones deprecate, and the price- quality frontier moves[15]. None of these are visible to a dispatcher that selects models by static configuration.

A buyer accepting this state of affairs typically responds in one of three ways: pin the safest model and pay the premium, pin the cheapest model and absorb the variance, or layer an offline A/B test on top of a static dispatcher and update the configuration by hand. None of the three scales. The first wastes capacity; the second wastes outcomes; the third turns the dispatcher into a manual rebalancing job. What is needed is a routing layer that treats the choice of model as a non-stationary contextual decision — one that absorbs the drift instead of papering over it.

The Self-Evolving Model Router is the dispatch tier of VDF AI Networks. It is designed around the observation that every routing decision is a bandit problem with a context vector and a stream...

The Self-Evolving Model Router

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy

SpaceX not the behemoth everyone thought

Naphtha Shortages Having a Growing Impact in Japan