LLMs-from-scratch/ch04/09_dsa at main · rasbt/LLMs-from-scratch · GitHub
//files/disambiguate" data-turbo-transient="true" />
Skip to content
Search or jump to...
Search code, repositories, users, issues, pull requests...
-->
Search
Clear
Search syntax tips
Provide feedback
--><br>We read every piece of feedback, and take your input very seriously.
Include my email address so I can be contacted
Cancel
Submit feedback
Saved searches
Use saved searches to filter your results more quickly
-->
Name
Query
To see all available qualifiers, see our documentation.
Cancel
Create saved search
Sign in
//files/disambiguate;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up
Appearance settings
Resetting focus
You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.
Dismiss alert
{{ message }}
rasbt
LLMs-from-scratch
Public
Notifications<br>You must be signed in to change notification settings
Fork<br>14.7k
Star<br>95.7k
FilesExpand file tree
main
/09_dsa<br>Copy path
Directory actions
More options<br>More options
Directory actions
More options<br>More options
Latest commit
History<br>History<br>History
main
/09_dsa
Top
Folders and files<br>NameNameLast commit message<br>Last commit date<br>parent directory<br>..<br>README.md
README.md
gpt_with_kv_dsa.py
gpt_with_kv_dsa.py
test_dsa.py
test_dsa.py
View all files
README.md<br>Outline<br>DeepSeek Sparse Attention (DSA)
This bonus material implements the DeepSeek Sparse Attention (DSA) mechanism introduced in DeepSeek-V3.2 and first published in the experimental DeepSeek-V3.2-Exp release.
The overview below follows the DSA discussion in From DeepSeek V3 to V3.2: Architecture, Sparse Attention, and RL Updates.
Introduction
Standard causal self-attention attends to all previous tokens for each query, yielding O(L²) compute and O(L) KV-cache growth with sequence length L.
Sliding Window Attention (SWA) already showed that restricting attention to a fixed local window substantially reduces this cost. In SWA, each query token attends only to a local span of nearby previous tokens.
Figure 1. Sliding-window attention restricts each query token to a fixed local context window.
DSA uses the same broad idea of attending to only a subset of previous tokens. However, it replaces the fixed window with a learned selection mechanism. For each query token, the model scores candidate past tokens and keeps only the most relevant ones.
Figure 2. DeepSeek Sparse Attention selects a learned subset of past tokens for each query token.
Architecture overview
DSA adds two components on top of standard attention.
1. Lightning Indexer
For each query token $t$ and every candidate past token $s$, the indexer computes a scalar relevance score. This implementation makes the scale factors from the reference code explicit:
$$I_{t,s} = \sum_{j=1}^{H_I} \frac{w_{t,j}}{\sqrt{H_I}} \cdot \text{ReLU}\left(\frac{q_{t,j} \cdot k_s}{\sqrt{d_I}}\right)$$
where:
$H_I$ is the number of lightweight index heads,
$q_{t,j}$ is the indexer query vector for token $t$ and head $j$,
$k_s$ is a shared indexer key vector for past token $s$,
$w_{t,j}$ is a learned per-head gate scaled by $1 / \sqrt{H_I}$.
The ReLU zeroes out negative dot-product contributions, and the gated sum aggregates across index heads into a single relevance score per past token.
In the full DeepSeek model, the indexer works with the compressed token representations from Multi-Head Latent Attention (MLA). This folder keeps the GPT implementation simpler and computes the indexer queries and keys from the regular hidden states.
2. Token Selector
After computing all indexer scores, only the top-K highest-scoring positions are kept. All other positions are masked to −∞ before the standard softmax, so the model effectively attends to only $k \ll L$ tokens.
The ReLU in the indexer is not where the final sparsity comes from. Since the scores are summed over multiple index heads, most final scores can still be nonzero. The token selector creates the sparse pattern by keeping only the top-K positions.
In a fused production implementation, this can lower attention compute from O(L²) to O(L·k). The implementation here keeps the standard dense attention score matrix and applies the DSA-selected top-K mask before softmax. This makes the selection logic easy to inspect, but it does not provide the fused-kernel compute savings.
The figure below summarizes the flow. The lightning indexer scores candidate tokens, the selector keeps top-K positions, and the resulting mask restricts the usual attention softmax.
Figure 3. DSA first scores candidate tokens, then keeps the top-K tokens for the final attention mask.
Implementation
gpt_with_kv_dsa.py provides:
Class<br>Description
LightningIndexer<br>Lightweight multi-head scorer for past-token...