DeepSeek Sparse Attention

LLMs-from-scratch/ch04/09_dsa at main · rasbt/LLMs-from-scratch · GitHub

//files/disambiguate" data-turbo-transient="true" />

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Clear

Search syntax tips

Provide feedback

--> We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

//files/disambiguate;ref_cta:Sign up;ref_loc:header logged out"}" Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

rasbt

LLMs-from-scratch

Public

Notifications You must be signed in to change notification settings

Fork 14.7k

Star 95.7k

FilesExpand file tree

main

/09_dsa Copy path

Directory actions

More options More options

Directory actions

More options More options

Latest commit

History History History

main

/09_dsa

Top

Folders and files NameNameLast commit message Last commit date parent directory .. README.md

README.md

gpt_with_kv_dsa.py

test_dsa.py

View all files

README.md Outline DeepSeek Sparse Attention (DSA)

This bonus material implements the DeepSeek Sparse Attention (DSA) mechanism introduced in DeepSeek-V3.2 and first published in the experimental DeepSeek-V3.2-Exp release.

The overview below follows the DSA discussion in From DeepSeek V3 to V3.2: Architecture, Sparse Attention, and RL Updates.

Introduction

Standard causal self-attention attends to all previous tokens for each query, yielding O(L²) compute and O(L) KV-cache growth with sequence length L.

Sliding Window Attention (SWA) already showed that restricting attention to a fixed local window substantially reduces this cost. In SWA, each query token attends only to a local span of nearby previous tokens.

Figure 1. Sliding-window attention restricts each query token to a fixed local context window.

DSA uses the same broad idea of attending to only a subset of previous tokens. However, it replaces the fixed window with a learned selection mechanism. For each query token, the model scores candidate past tokens and keeps only the most relevant ones.

Figure 2. DeepSeek Sparse Attention selects a learned subset of past tokens for each query token.

Architecture overview

DSA adds two components on top of standard attention.

1. Lightning Indexer

For each query token $t$ and every candidate past token $s$, the indexer computes a scalar relevance score. This implementation makes the scale factors from the reference code explicit:

$$I_{t,s} = \sum_{j=1}^{H_I} \frac{w_{t,j}}{\sqrt{H_I}} \cdot \text{ReLU}\left(\frac{q_{t,j} \cdot k_s}{\sqrt{d_I}}\right)$$

where:

$H_I$ is the number of lightweight index heads,

$q_{t,j}$ is the indexer query vector for token $t$ and head $j$,

$k_s$ is a shared indexer key vector for past token $s$,

$w_{t,j}$ is a learned per-head gate scaled by $1 / \sqrt{H_I}$.

The ReLU zeroes out negative dot-product contributions, and the gated sum aggregates across index heads into a single relevance score per past token.

In the full DeepSeek model, the indexer works with the compressed token representations from Multi-Head Latent Attention (MLA). This folder keeps the GPT implementation simpler and computes the indexer queries and keys from the regular hidden states.

2. Token Selector

After computing all indexer scores, only the top-K highest-scoring positions are kept. All other positions are masked to −∞ before the standard softmax, so the model effectively attends to only $k \ll L$ tokens.

The ReLU in the indexer is not where the final sparsity comes from. Since the scores are summed over multiple index heads, most final scores can still be nonzero. The token selector creates the sparse pattern by keeping only the top-K positions.

In a fused production implementation, this can lower attention compute from O(L²) to O(L·k). The implementation here keeps the standard dense attention score matrix and applies the DSA-selected top-K mask before softmax. This makes the selection logic easy to inspect, but it does not provide the fused-kernel compute savings.

The figure below summarizes the flow. The lightning indexer scores candidate tokens, the selector keeps top-K positions, and the resulting mask restricts the usual attention softmax.

Figure 3. DSA first scores candidate tokens, then keeps the top-K tokens for the final attention mask.

Implementation

gpt_with_kv_dsa.py provides:

Class Description

LightningIndexer Lightweight multi-head scorer for past-token...

DeepSeek Sparse Attention

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

SpaceX not the behemoth everyone thought

The Mirror Is Part of the Machine

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits