Usual implementation of attention transformers (SDPA) is kind of bad, actually

The usual implementaiton of attention transformers (SDPA) is kind of bad, actually · GitHub

/" data-turbo-transient="true" />

-->

Search Gists

You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

Instantly share code, notes, and snippets.

celoyd/antisdpa.md

Last active<br>January 15, 2026 21:56

Show Gist options

Download ZIP

Star

(1)

You must be signed in to star a gist

Fork

(0)

You must be signed in to fork a gist

Embed

Select an option

Embed<br>Embed this gist in your website.

Share<br>Copy sharable link for this gist.

Clone via HTTPS<br>Clone using the web URL.

No results found

Learn more about clone URLs

Clone this repository at <script src="https://gist.github.com/celoyd/6bf10122c3f5f7e64b0c684704e4ffb2.js"></script>

" readonly="readonly" data-autoselect="true" data-target="primer-text-field.inputElement " aria-describedby="validation-830f36ca-34cc-4a7e-8f29-3a151e42631a" class="form-control FormControl-monospace FormControl-input FormControl-small rounded-left-0 rounded-right-0 border-right-0" type="text" name="gist-share-url-sized-down" />

Save celoyd/6bf10122c3f5f7e64b0c684704e4ffb2 to your computer and use it in GitHub Desktop.

Embed

Select an option

Embed<br>Embed this gist in your website.

Share<br>Copy sharable link for this gist.

Clone via HTTPS<br>Clone using the web URL.

No results found

Learn more about clone URLs

Clone this repository at <script src="https://gist.github.com/celoyd/6bf10122c3f5f7e64b0c684704e4ffb2.js"></script>

" readonly="readonly" data-autoselect="true" data-target="primer-text-field.inputElement " aria-describedby="validation-5ee81a67-2faa-4d97-9938-12016c9aabd7" class="form-control FormControl-monospace FormControl-input FormControl-small rounded-left-0 rounded-right-0 border-right-0" type="text" name="gist-share-url-original" />

Save celoyd/6bf10122c3f5f7e64b0c684704e4ffb2 to your computer and use it in GitHub Desktop.

Download ZIP

The usual implementaiton of attention transformers (SDPA) is kind of bad, actually

Raw

antisdpa.md

Introduction

I was writing a note to a friend that mentioned my tedious opinions on “AI” discourse. It veered off into my usual argument that big “AI” companies are shaping the industry ecosystem to their own ends by setting up a situation where expensive-to-run models are overvalued. I think they’re doing this because they have a competitive advantage in that tier of the market, having bought (time on) a lot of GPUs. It’s like how a company that owns diamond mines will probably promote the idea that large, mined diamonds are important and valuable, and that there’s something off about running a sub-industrial mine or lab-growing diamonds. You can do this without lying at all, but I still dislike it. Large mined diamonds here are $O(n^2)$ models.

To support this argument, I started making my case against the necessity of the standard transformer model. I admit that the case is scattershot and circumstantial. It’s not that SDPA (the normal transformer architecture) is a fraud, or that there is something much better ready to replace it everywhere and immediately. But maybe I can sow some doubts that SDPA is as good as the median ML practitioner assumes, and raise some hopes for better kinds of models in the pipeline.

That got out of hand in the e-mail I was writing, so I cut it out and put it here.

This note covers:

how some standard ML model families work, not in great depth but in order to have some context around…

how SDPA (the standard transformer) works;

some specific reasons I dislike SDPA; and

some things I hope might replace it.

This note does not make:

Normative judgments about any person or organization mentioned or not mentioned. I have very strong opinions about some of them, especially ones not mentioned, and my points here underlie some of those opinions. But it is not those opinions.

Any airtight case that SDPA is bad. If you love SDPA, you will probably still love SDPA after reading. That’s fine with me.

A nice, brief, well-organized argument. It was written in a sitting and when I came back to trim it down I accidentally added more. (And removed an embarrassing mistake where I said RWKV uses SSMs. I don’t know why I said that.)

Seven years ago, if you asked for the general architectures of the most studied and most widely applied ML models, you might get this list:

1. Fully connected networks (FCNs)

All inputs are fed in at once. A multi-layer perceptron (or a recognizable development of one) digests it, and you get some output.

Early on, these were studied for images, where each pixel is an input, so for example a 1e3 × 1e3 image is a vector of 1e6 inputs. It soon turned out (1) that this was wildly...

Usual implementation of attention transformers (SDPA) is kind of bad, actually

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast