LT2: Linear-Time Looped Transformers

Chunyuan Deng1, Yizhe Zhang2, Rui-jie Zhu3, Yuanyuan Xu1, Jiarui Liu4, T. S. Eugene Ng1, Hanjie Chen1

1 Rice University 2 Apple 3 UC Santa Cruz 4 Carnegie Mellon University

Paper

Code

Ouro-hybrid-1.4B

Figure 1. (Left) LT2 occupies a new region of the parameter-efficiency frontier: for the same parameter budget, LT2 models achieve better quality with far lower inference cost than standard Looped Transformers. (Right) After distillation from a pre-trained full-attention Looped Transformer, Ouro-hybrid-1.4B is competitive with industry-level 3B–4B models while inheriting LT2's linear-time inference.

Overview

The scaling problem of looped full attention. Attention FLOPs (left) and KV-cache memory (right) for a 1.3B model vs. sequence length. Because each loop re-runs full attention, both costs compound with the number of loops. LT2's linear/sparse mixers keep both curves flat regardless of loop count.

Looped Transformers (LT) are an elegant idea: instead of stacking many independently-parameterized layers, reuse the same block of weights $T$ times before producing the output token. This gives $T\times$ the effective depth at $1\times$ the parameter count — a compelling handle for parameter-efficient reasoning at inference time.

But there is a catch. Each loop re-runs full quadratic self-attention over the entire sequence. FLOPs grow as $\mathcal{O}(L^2)$ per loop iteration, and the KV-cache grows as $\mathcal{O}(T \cdot L)$ at inference. As you add more loops to get more reasoning depth, the attention cost compounds — exactly where you want to scale, the architecture becomes most expensive.

LT2 (Linear-Time Looped Transformers) asks: can we keep the looping, but cut the attention cost? We replace full softmax attention inside each loop with subquadratic token mixers — linear attention and sparse attention — and find that looping and efficient attention are not just compatible, but genuinely synergistic. The loop changes what the efficient mixer can do, not just how many times it runs.

Subquadratic Attention in Looped Transformers

Architecture formulation

A standard Transformer of depth $N$ stacks $N$ independently-parameterized blocks $\{\mathcal{F}_\ell\}_{\ell=1}^{N}$:

$$\mathcal{F}_\ell(\mathbf{h}) = \mathbf{h}' + \mathrm{FFN}_\ell(\mathbf{h}'), \qquad \mathbf{h}' = \mathbf{h} + \mathrm{MHA}_\ell(\mathbf{h}).$$

A Looped Transformer (LT) reuses these $N$ shared blocks for $T$ iterations:

$$\mathbf{h}^{(0)} = \mathrm{Emb}(\mathbf{x}), \quad \mathbf{h}^{(\tau)} = \bigl(\mathcal{F}_N \circ \cdots \circ \mathcal{F}_1\bigr)\!\bigl(\mathbf{h}^{(\tau-1)}\bigr), \quad \tau = 1, \ldots, T,$$

yielding effective depth $T \cdot N$ with only $N$ unique parameter sets. Each $\mathrm{MHA}_\ell$ costs $\mathcal{O}(L^2)$ FLOPs and the KV-cache at inference is $\mathcal{O}(T \cdot L)$ — both scale linearly with $T$. LT2 replaces MHA with a subquadratic token mixer:

$$\mathbf{h}' = \mathbf{h} + \mathrm{LinearMixer}_\ell(\mathbf{h}),$$

keeping the looping, weight sharing, and a learned per-loop residual gate $\mathbf{h}^{(\tau)} = \widetilde{\mathbf{h}}^{(\tau)} + \boldsymbol{\rho}_\tau \odot \mathbf{h}^{(\tau-1)}$ unchanged. Beyond efficiency, looping amplifies the expressive power of subquadratic mixers in two distinct ways.

Linear attention: rank-$T$ memory update

Frontier linear-attention architectures (GDN, KDA, RWKV7) maintain a fixed-size recurrent state $\mathbf{S}_t \in \mathbb{R}^{d_k \times d_v}$ via a DPLR operator:

$$\mathbf{S}_t = \mathbf{A}_t\,\mathbf{S}_{t-1} + \beta_t\,\mathbf{k}_t\mathbf{v}_t^{\top}, \qquad \mathbf{A}_t = \mathrm{Diag}(\boldsymbol{\alpha}_t)\bigl(\mathbf{I} - \beta_t\,\mathbf{k}_t\mathbf{k}_t^{\top}\bigr).$$

The matrix $\mathbf{A}_t$ is identity plus a rank-1 perturbation, so a single non-looped DPLR block can only modify recurrent memory along one direction per token. When looped $T$ times, the cumulative state-transition operator across all iterations is:

$$\mathbf{A}_t^{\mathrm{eff}} = \prod_{\tau=1}^{T} \mathbf{A}_t^{(\tau)} = \prod_{\tau=1}^{T} \mathrm{Diag}\!\bigl(\boldsymbol{\alpha}_t^{(\tau)}\bigr)\!\left(\mathbf{I} - \beta_t^{(\tau)}\,\mathbf{k}_t^{(\tau)}\mathbf{k}_t^{(\tau)\top}\right).$$

When the per-loop keys $\{\mathbf{k}_t^{(\tau)}\}$ are orthogonal (which diverse intermediate representations approach in practice), the product erases $T$ distinct directions in memory — yielding a rank-$T$ perturbation and directly multiplying the state-tracking capacity without any added parameters.

Sparse attention: $\mathcal{O}(Tw)$ receptive field

A sliding-window block with window $w$ restricts each query at position $t$ to attend only to tokens $\mathcal{I}_t^{(1)} = \{t - w + 1, \ldots, t\}$. After $T$ loop iterations, information propagates further each loop, and chaining this inductively gives:

$$\mathcal{I}_t^{(T)} \supseteq \bigl\{\max(1,\, t - Tw + 1),\, \ldots,\, t\bigr\}, \qquad...

LT2: Linear-Time Looped Transformers

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

SpaceX not the behemoth everyone thought

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play