LT2: Linear-Time Looped Transformers
LT2: Linear-Time Looped Transformers
Chunyuan Deng1,<br>Yizhe Zhang2,<br>Rui-jie Zhu3,<br>Yuanyuan Xu1,<br>Jiarui Liu4,<br>T. S. Eugene Ng1,<br>Hanjie Chen1
1 Rice University<br>2 Apple<br>3 UC Santa Cruz<br>4 Carnegie Mellon University
Paper
Code
Ouro-hybrid-1.4B
Figure 1.<br>(Left) LT2 occupies a new region of the parameter-efficiency frontier: for the same parameter budget, LT2 models achieve better quality with far lower inference cost than standard Looped Transformers.<br>(Right) After distillation from a pre-trained full-attention Looped Transformer, Ouro-hybrid-1.4B is competitive with industry-level 3B–4B models while inheriting LT2's linear-time inference.
Overview
The scaling problem of looped full attention.<br>Attention FLOPs (left) and KV-cache memory (right) for a 1.3B model vs. sequence length.<br>Because each loop re-runs full attention, both costs compound with the number of loops.<br>LT2's linear/sparse mixers keep both curves flat regardless of loop count.
Looped Transformers (LT) are an elegant idea: instead of stacking many independently-parameterized layers,<br>reuse the same block of weights $T$ times before producing the output token.<br>This gives $T\times$ the effective depth at $1\times$ the parameter count — a compelling handle for<br>parameter-efficient reasoning at inference time.
But there is a catch. Each loop re-runs full quadratic self-attention over the entire sequence.<br>FLOPs grow as $\mathcal{O}(L^2)$ per loop iteration, and the KV-cache grows as $\mathcal{O}(T \cdot L)$ at inference.<br>As you add more loops to get more reasoning depth, the attention cost compounds — exactly where<br>you want to scale, the architecture becomes most expensive.
LT2 (Linear-Time Looped Transformers) asks: can we keep the looping, but cut<br>the attention cost? We replace full softmax attention inside each loop with subquadratic token mixers —<br>linear attention and sparse attention — and find that looping and efficient attention are not just compatible,<br>but genuinely synergistic. The loop changes what the efficient mixer can do, not just how many times it runs.
Subquadratic Attention in Looped Transformers
Architecture formulation
A standard Transformer of depth $N$ stacks $N$ independently-parameterized blocks<br>$\{\mathcal{F}_\ell\}_{\ell=1}^{N}$:
$$\mathcal{F}_\ell(\mathbf{h}) = \mathbf{h}' + \mathrm{FFN}_\ell(\mathbf{h}'), \qquad \mathbf{h}' = \mathbf{h} + \mathrm{MHA}_\ell(\mathbf{h}).$$
A Looped Transformer (LT) reuses these $N$ shared blocks for $T$ iterations:
$$\mathbf{h}^{(0)} = \mathrm{Emb}(\mathbf{x}), \quad \mathbf{h}^{(\tau)} = \bigl(\mathcal{F}_N \circ \cdots \circ \mathcal{F}_1\bigr)\!\bigl(\mathbf{h}^{(\tau-1)}\bigr), \quad \tau = 1, \ldots, T,$$
yielding effective depth $T \cdot N$ with only $N$ unique parameter sets.<br>Each $\mathrm{MHA}_\ell$ costs $\mathcal{O}(L^2)$ FLOPs and the KV-cache at inference is $\mathcal{O}(T \cdot L)$ — both scale linearly with $T$.<br>LT2 replaces MHA with a subquadratic token mixer:
$$\mathbf{h}' = \mathbf{h} + \mathrm{LinearMixer}_\ell(\mathbf{h}),$$
keeping the looping, weight sharing, and a learned per-loop residual gate<br>$\mathbf{h}^{(\tau)} = \widetilde{\mathbf{h}}^{(\tau)} + \boldsymbol{\rho}_\tau \odot \mathbf{h}^{(\tau-1)}$ unchanged.<br>Beyond efficiency, looping amplifies the expressive power of subquadratic mixers in two distinct ways.
Linear attention: rank-$T$ memory update
Frontier linear-attention architectures (GDN, KDA, RWKV7) maintain a fixed-size recurrent state<br>$\mathbf{S}_t \in \mathbb{R}^{d_k \times d_v}$ via a DPLR operator:
$$\mathbf{S}_t = \mathbf{A}_t\,\mathbf{S}_{t-1} + \beta_t\,\mathbf{k}_t\mathbf{v}_t^{\top}, \qquad \mathbf{A}_t = \mathrm{Diag}(\boldsymbol{\alpha}_t)\bigl(\mathbf{I} - \beta_t\,\mathbf{k}_t\mathbf{k}_t^{\top}\bigr).$$
The matrix $\mathbf{A}_t$ is identity plus a rank-1 perturbation, so a single non-looped DPLR block<br>can only modify recurrent memory along one direction per token.<br>When looped $T$ times, the cumulative state-transition operator across all iterations is:
$$\mathbf{A}_t^{\mathrm{eff}} = \prod_{\tau=1}^{T} \mathbf{A}_t^{(\tau)} = \prod_{\tau=1}^{T} \mathrm{Diag}\!\bigl(\boldsymbol{\alpha}_t^{(\tau)}\bigr)\!\left(\mathbf{I} - \beta_t^{(\tau)}\,\mathbf{k}_t^{(\tau)}\mathbf{k}_t^{(\tau)\top}\right).$$
When the per-loop keys $\{\mathbf{k}_t^{(\tau)}\}$ are orthogonal (which diverse intermediate representations approach in practice),<br>the product erases $T$ distinct directions in memory — yielding a rank-$T$ perturbation<br>and directly multiplying the state-tracking capacity without any added parameters.
Sparse attention: $\mathcal{O}(Tw)$ receptive field
A sliding-window block with window $w$ restricts each query at position $t$ to attend only to tokens<br>$\mathcal{I}_t^{(1)} = \{t - w + 1, \ldots, t\}$.<br>After $T$ loop iterations, information propagates further each loop, and chaining this inductively gives:
$$\mathcal{I}_t^{(T)} \supseteq \bigl\{\max(1,\, t - Tw + 1),\, \ldots,\, t\bigr\}, \qquad...