Borrowing the Night: Reclaiming Idle Inference GPUs for Research

Runway News | Borrowing the Night: Reclaiming Idle Inference GPUs for Research

Enterprise SalesLoginTry Runway

Borrowing the Night: Reclaiming Idle Inference GPUs for Research<br>July 2, 2026<br>by Runway Platform Team

Production inference demand rises and falls in a daily wave. We built a capacity controller that reallocates GPUs between production and research so production tracks demand without over-provisioning. Using queueing theory we optimized allocations, leading to more GPUs for research overnight and shorter queue waits all day.

In a previous post we described how we use Kueue to lend idle GPUs across research initiatives. Here we focus on reclaiming production GPUs for research during off-peak hours, then returning them before the morning peak.

Production inference demand moves in cycles. Even with a global user base, traffic concentrates around North American working hours: it climbs through the morning, peaks around 9am ET and bottoms out around 8pm ET . The trough can be less than half the peak.

That creates a familiar dilemma for every AI company. Provision for peak demand and most GPUs sit idle every night. Provision for trough demand and queues blow out every morning.

Instead, the production fleet should follow the demand cycle: grow into the morning peak, shrink into the evening trough and lend whatever it isn't using to research.

provision for the peak (static)12am6am9am12pm4pm8pm12amhour of day (ET)demand / GPUsGPUs allocated to productionproduction demandyielded to researchProduction capacity follows demand up to the morning peak and releases the difference, the shaded area, to research overnight.

We built a lightweight capacity controller to do this reallocation (internally called deckard, after the Blade Runner protagonist who relentlessly reclaims replicants). It is deliberately narrow, managing exactly two things:

Workloads : the replica counts of our Kubernetes inference deployments (one bucket of GPUs per model).

Compute : the size of the underlying cloud GPU node pools, so it can physically move nodes between production and research clusters.

Rather than re-deciding allocations continuously, the controller applies a small set of time windows , each with its own pre-computed schedule:

12am6am12pm6pm12ampeakMonTueWedThuFriSatSunweekday daypeak (8:30–12:30)weekday nightweekend dayweekend nightFive windows, each with its own pre-computed schedule. The high-traffic peak (8:30am–12:30pm ET on weekdays) is carved out as a sub-window so it can scale up harder than the rest of the day.

Windows are coarse on purpose, because moving GPUs between clusters is expensive: draining and tearing a node down on one side and standing it back up on the other takes 20 to 60 minutes on our cloud provider.

It's fair to ask why we make this hard for ourselves and shift capacity between clusters. If research and production shared one multi-tenant Kubernetes cluster, transferring a GPU allocation between them would be a scheduling decision rather than a physical transfer, much like Kueue already lends idle GPUs within a single cluster in the previous post.

We keep the environments in separate clusters anyway. The isolation gives us:

Blast-radius containment. A runaway research job, or a single overly broad ClusterRole, can't take down customer-facing inference.

Independent infrastructure. Separate clusters can run different Kubernetes versions, GPU drivers and networking stacks, so we can test risky infra upgrades on research without exposing production. We've needed this when a driver or networking change broke on certain versions, or when we want to run a bleeding-edge PyTorch to take advantage of training performance improvements.

Because transfers are slow, the controller predicts the wave for a given window and moves capacity ahead of demand.

A Crash Course in Queueing Theory

So how many GPUs do we need to service requests? One approach is to use queueing theory. The field predates digital computing. In 1909, the Danish mathematician Agner Krarup Erlang studied telephone switchboards to figure out how many circuits were needed so callers rarely hear a busy signal.

Translating the definitions to our domain, there are four key variables:

arrival_rate: how fast new generations arrive (requests / second).

service_rate: how fast one GPU worker serves requests (requests / second). (So average runtime is 1 / service_rate.)

num_servers: how many GPU workers we run.

traffic_intensity: how "busy" the system is on average: traffic_intensity = arrival_rate / (num_servers * service_rate).

A motivating example: suppose one GPU worker completes 1 request/sec (service_rate = 1) and you run 100 GPUs for arrival_rate = 95, so traffic_intensity = 0.95. Now say you get a short burst to 110 req/sec for a couple of minutes (or a few GPUs temporarily drain/restart). Backlog accumulates at ~10 req/sec during the burst. Even after demand returns to 95, you only "catch up" at 5 req/sec (100 served − 95...

Borrowing the Night: Reclaiming Idle Inference GPUs for Research

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

ZCode – Harness for GLM-5.2

Apertus – Open Foundation Model for Sovereign AI