Meta‑Attention Is All You Need

Meta‑Attention Is All You Need | MediumSitemapOpen in appSign up Sign in

Medium Logo

Get app Write

Artem X

23 min read· Just now

Listen

Introduction In this article I want to talk about an interesting finding from my experiments with language models, which I decided to call “meta-transformers”. Either I found something genuinely interesting, or I mistook wishful thinking for reality. Only a technically competent outside observer can give an objective assessment, and that is why this text was published. Specialists in transformer architecture would be especially welcome here. Model weights, project source code, and all documentation will be linked at the end of the article, in the Sources section: Hugging Face for weights, Codeberg (a GitHub-like platform) for the code. Initially the project had Russian documentation and comments, but I translated the comments and docs into English for the global community through Codex. Codeberg will contain both the original RU version and the translated ENG version. The article will live on Codeberg, in both Russian and English, in the root directory as meta-attention-is-all-you-need.md. You can find the preview diagram at the beginning of the Architectural Diagrams section. upd: I changed the cover to a nicer one; nothing else in the article changed.

All sections: Important notes Getting acquainted with meta-transformers Detailed component breakdown Detailed training breakdown Experiments Architectural diagrams Conclusion Sources 1. Important notes The information in this section is not required to understand the architecture. I still recommend reading it, but you can skip straight to the architecture description in the “Getting acquainted with meta-transformers” section if you want.

Given how specific this project and its related concepts are, and not wanting to look like yet another mad inventor who claims to have solved every Millennium Prize problem at once, I put quite a few remarks into this section. I recommend reading them before moving on to the main material. This is a classic weekend project that I worked on in my free time outside my job. It would be disappointing if the idea failed, but I do not really lose much either way, so in my opinion I can be fairly objective here and open to criticism. The title reference Some informed readers may have noticed that the article title references the 2017 paper “Attention Is All You Need”, which first described the transformer architecture. Of course, I am not putting my idea on the same level as that paper. The mechanism and operating principle are simply fairly similar. Still, I cannot evaluate the significance of this idea myself, or whether it has any significance at all. I lack the expertise and, most importantly, competent feedback. That is why, again, you are reading this text. Uniqueness Since the idea, in a very general form, seems fairly suggestive and simple, it is entirely possible that someone has already tried it and I simply did not search well enough. I would be glad if you pointed that out. Another project with the same name If you search Google, you may find another “meta-transformer” architecture that also modifies transformers. That is where the similarities end. In short, it is a framework for unifying 12 modalities by providing a common token space for them. Why it was called meta-transformers is anyone’s guess; most likely it was just for a nice name. Technically, it would be more accurate to call it a meta-modal architecture. To check that I am not misrepresenting it, you can read the paper about that architecture here. Experiment metrics I recommend not taking the reported numbers on faith. I am one programmer, not especially brilliant, with a pet project I worked on in my free time. I could easily have made mistakes. If you have the expertise and the desire to run your own tests, I would be glad if you shared them in the comments or by DM. Origins and duration of the experiments The earliest sketches of this architecture appeared back in August 2025, but they have little in common with where the idea eventually went. Back then it was called a “reflexive core”, and the goal was to teach a language model to “think about its own thinking”. In its current form, the project appeared in March of this year and took roughly one month of dense work with Claude Code on the max 5x plan, plus about $30 on vast.ai for training. 2. Getting acquainted with meta-transformers The meta-transformer architecture at the beginning of the experiments and in the latest phase shares the same general principle, but differs in the details. This is an overview article, so it focuses mostly on the latest version. Information about all phases is available in the source code.

General principle Imagine a model that takes text as input and generates a continuation. When it receives tokens, vectors of numbers arise inside each layer. These are called...

Meta‑Attention Is All You Need

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

It's Not Just X. It's Y