Testing MiniMax M3 on refactoring, screenshot debugging, music recommendations

Testing MiniMax M3 on real tasks: repo refactor, screenshot debugging, and Spotify recommendations – Andrey Lukyanenko

Testing MiniMax M3 on real tasks: repo refactor, screenshot debugging, and Spotify recommendations

10 June 2026

Testing MiniMax M3 on real tasks: repo refactor, screenshot debugging, and Spotify recommendations

I got early access to MiniMax M3 , so I plugged it into Claude Code and used it to work on a few tasks that I wanted to complete for some time: a code audit and refactor of my old web game, two UI bugs from it that I had been putting off, and a music-recommendation experiment built from my Spotify history. I used M3 for the implementation work, then asked Opus 4.8 to review it.

M3 is the first open-weights model (will soon be fully open-sourced on HuggingFace and GitHub) to combine three things in one release: frontier-level coding and agentic ability, a 1M-token context window, and native multimodality. I reviewed MiniMax M2.7 earlier, and M3 is a clear step up from M2.7 in the areas I tested.

M3 was most useful when I gave it concrete artifacts — a repo, tests, screenshots, and data exports. It did a lot of real work quickly, but an independent review still caught some regressions.

What MSA is, and why MiniMax keeps changing its attention

MiniMax has changed its attention twice (if you want to know more about attention, you can read my note). MiniMax-01 and M1 used lightning attention , a linear-attention variant, in a 7:1 hybrid — seven linear layers per softmax layer. M2 and M2.7 then reverted to full attention; the team’s candid post Why Did M2 End Up as a Full Attention Model? blamed linear attention’s precision sensitivity, immature infra, and multi-hop deficits — all costs of approximating the softmax.

M3 uses MiniMax Sparse Attention (MSA) , which keeps the softmax exact and only narrows where it runs. An index branch cheaply scores blocks of context (one lightweight query per GQA group → block-max-pool → top-k), then the real query heads run ordinary full attention over just the selected blocks. MiniMax reports it running 4× faster than Flash-Sparse-Attention, at ~1/20 the per-token compute of M2, with 9× prefill and 15× decode speedups — their own numbers, unreproducible until the weights and report ship.

So MSA “matching full attention on the vast majority of capabilities” isn’t surprising: it doesn’t approximate, it selects. The only thing that can break is the selection — drop a block that mattered and the answer is gone. The real question is how good the selector is at long range.

Auditing and refactoring an old idle game

A year ago, I vibe-coded an idle game, Eternum Alchemist , with Sonnet, and I wanted to pick it up again. Before adding anything new, I asked M3 to carefully review the code for bugs, security issues, and logic problems. It spent roughly 30 minutes on the repository understanding and analysis, which isn’t surprising given it has ~100 files and ~26k lines of code.

The report was quite good. It was organized by severity (12 critical, around 20 high, 30 medium, 20 low), carried file paths and line numbers, and included a recommended order of work. Some of the most important issues were:

shouldAttack using an integer-modulo model that made every enemy with an attack speed above 1 always attack, so the snake monster was effectively slower than intended.

A lot of unfinished code/configs. For example, after skills reached prestige, they couldn’t level up, because their XP scaling was nested under ranks[rank] while the function read a top-level field and got NaN.

I asked M3 to fix all issues. It worked for ~2h 40m across three phases, increased the number of tests from 188 to 237, and most of the fixes were correct and well tested.

But then I asked Opus to review the changes, and it found two critical regressions that M3’s own green tests had hidden.

M3 added schema validation to the import path, changing the data format and conflicting with the save format. Thankfully, the game is in alpha or pre-alpha stage, so this is fine, but if this were in production, the saves would be broken.

M3 fixed non-working multipliers, but forgot that the crit hit chance was applied in two places, which resulted in it scaling as 1.05 to the power of twice the level. It was exactly the config-drift pattern the audit itself had flagged elsewhere and not fixed here.

Other than that, Opus found that six fixes were partial and six issues were untouched. As a takeaway, I can say that M3 did a large amount of correct, well-structured work quickly. But it was my mistake to let M3 both write the tests and fix the code issues. Next time, I’ll use two separate sessions for it.

Two UI bugs that needed a screenshot

The next two problems were UI-related, and that’s where M3’s multimodality came in handy.

The first was a freeze. On the Skills screen, clicking a skill froze the whole panel, and every click after the first did nothing. Describing...

Testing MiniMax M3 on refactoring, screenshot debugging, music recommendations

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs