Training over a thousand LoRA adapters at once

WorldPeas1 pts0 comments

Training Thousands of LoRA Adapters at Once

Book a Demo

Training Thousands of LoRA Adapters at Once<br>Published on<br>Jun 2, 2026

TL;DR: We extended Miles with a multi-adapter LoRA training path that lets us train thousands of LoRA adapters concurrently and asynchronously. The core change is a modification to Megatron-Bridge + Miles which allows us to load multiple LoRA adapters as a single matrix. On our Qwen3.6-35B-A3B + GSM8K stress test, we ran 1,536 LoRA adapter instances concurrently with step time under 3 minutes.

Overview

LoRA-based post-training decomposes a model into two components: a shared base model and a lightweight low-rank adapter that captures task-specific updates. This setup works well when training a single policy, since only the adapter parameters need to be optimized. However, it underutilizes resources when many policies are fine-tuned in parallel. Each LoRA training run still requires deployment of the full base model alongside its adapter, causing the same underlying model weights to be replicated across concurrent runs. As the number of policies scales, this duplication introduces substantial wasted VRAM.

What if we could share the same base model between policies, and just fine-tune different LoRA adapters in a single batch? This is cleaner and improves scalability: we can keep one base model, route tokens to different LoRA adapters, and have the training/inference stack treat LoRA adapters as cheap concurrent policies rather than separate model replicas.

We built on our prior work of supporting LoRA RL for the Qwen3.5 model family, with the goal of extending our training stack from "one LoRA policy" to "many LoRA policies, in the same training step."

The motivation is simple: base models are large, LoRA adapters are small. If we want to run thousands of RL experiments (i.e. prompt/harness, reward design, and curriculum ablations), we can't replicate the full base model for every individual training run.

We built our multi-LoRA framework on top of Miles, RadixArk's continuously evolving open source RL post-training framework. Miles already provides us the pieces we require for large scale RL, such as:

Megatron-based training with support for flexible modifications

SGLang-based rollout with support for scaling to thousands of LoRA adapters

Unified FP8 training support

We added multi-LoRA training by:

Deploying one shared Qwen3.6-35B-A3B base model

Supporting multiple LoRA adapter slots in Megatron-Bridge

Implementing multi-LoRA rollouts and training with Miles

Online loading and unloading adapters without restarting the RL trainer

Serving multiple LoRA adapters using SGLang's native multi-LoRA interface

Keeping experts adapter-free to unlock additional memory savings

LoRA with GRPO

GRPO-style RL objectives already separate the trainable policy from a frozen reference model. In the DeepSeekMath/GRPO objective, the policy update is regularized by a KL term against \(\pi_{\text{ref}}\):

\[\mathcal{J}_{\text{GRPO}}(\theta) = \mathbb{E} \left[ \frac{1}{G} \sum_{i=1}^{G} \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \left( \min(\cdots) - \beta D_{\text{KL}} \left[ \pi_\theta \Vert \pi_{\text{ref}} \right] \right) \right]\]

For LoRA policy \(a\), the model parameters are:

\[\theta_a = \theta_0 + \Delta_a\]

where \(\theta_0\) is the shared base model and \(\Delta_a\) is the adapter-specific LoRA delta. If every LoRA adapter starts from the same base model, then the correct frozen reference for every one is the same model:

\[\pi_{\text{ref}, a} = \pi_{\theta_0}\]

The policy logprob depends on the adapter:

\[\log \pi_a(y_t \mid s_t) = \log \pi_{\theta_0 + \Delta_a}(y_t \mid s_t)\]

But the reference logprob does not:

\[\log \pi_{\text{ref}}(y_t \mid s_t) = \log \pi_{\theta_0}(y_t \mid s_t)\]

Using the common sampled KL estimator from GRPO implementations, define:

\[r_{a,t} = \frac{ \pi_{\text{ref}}(y_t \mid s_t) }{ \pi_a(y_t \mid s_t) } = \exp\left( \log \pi_{\text{ref}}(y_t \mid s_t) - \log \pi_a(y_t \mid s_t) \right)\]

Then the token-level KL penalty is:

\[\widehat{D}_{\text{KL}, a,t} = r_{a,t} - \log r_{a,t} - 1\]

Only the policy logprob changes with adapter \(a\). The reference logprob is shared across all adapters for the same token state. Therefore, one frozen reference model can score the packed batch once, and each adapter's policy loss can reuse those reference logprobs.

Crucially, this policy only holds when all adapters share the same reference checkpoint. If adapter slots are warm-started from different base models or finetunes, this theory breaks down and we are forced to use multiple reference models.

Multi-LoRA Implementation

Multiple Adapters in Megatron-Bridge

We began by adding a new multi-LoRA transform within Megatron-Bridge's PEFT module. The new transform, MultiLoRA, behaves similar to a normal PEFT transform by attaching the LoRA weights and freezing the layer. However, instead of wrapping a target module with one adapter, it wraps it with N...

lora training model adapters adapter base

Related Articles