I had an AI run my "revolutionary" idea overnight, then tell me it wasn't novel

I Let an AI Run a “Revolutionary” Idea on My MacBook Overnight. The Best Part Was Being Told It Wasn’t New. | by Jmafonsomoreira | Jun, 2026 | MediumSitemapOpen in appSign up Sign in

Medium Logo

Get app Write

I Let an AI Run a “Revolutionary” Idea on My MacBook Overnight. The Best Part Was Being Told It Wasn’t New.

Jmafonsomoreira

5 min read· Just now

Listen

A small story about a backspace key, a 94%-vs-4% result, and why the most useful thing my AI collaborator did was disappoint me. It started as a shower thought. Large language models write one word at a time, always forward, never back. Each token is final the instant it’s chosen. And I wondered: what’s the smallest change that would let one undo — to type a token and then hit backspace, like a human writing a rough draft? I asked Claude. The answer was elegant: you don’t need a new architecture. You add one token to the vocabulary — call it BKSP — and a rule for what it means. When the model emits it, you pop the last token off a little stack instead of appending. That’s it. One extra entry in a lookup table turns a write-only model into one that can revise. Nice theory. But does it actually buy anything? So I did the thing you can finally do in 2026: I told Claude to go test it. On my MacBook. For real. What “go test it” turned into I half-expected a toy script. Instead, over a few hours on an M4 Max, Claude: built a small transformer from scratch (~6.5M parameters, plain PyTorch on Apple’s GPU); invented a clean task to measure the thing — navigating little mazes shaped like trees, full of dead ends, where the only way through is to explore, hit a wall, and back up; and — the part that genuinely surprised me — stress-tested its own experimental design with a panel of adversarial reviewers before spending a single GPU-second, hunting for ways the result could be fake (data leakage, shortcuts, unfair comparisons). It found several and fixed them. The setup ended up being a clean four-way fight — same model and data in every corner, differing only in how the model is allowed to answer: DIRECT — the normal model. It must output the path straight away. No scratch work. FILLER — it gets to emit a bunch of extra “thinking” tokens, but they’re meaningless filler. (A control: does more output alone help?) DELETE — it can backtrack, but the abandoned steps are truly erased from what it can see. Real “delete.” KEEP — same backtracking, but the crossed-out attempts stay visible in its context; the backspace only cleans up the final answer. Think pencil marks you strike through but leave in the margin. Every model was tested on mazes it had never seen — not just new layouts, new structures. And nothing was graded by vibes: a checker mechanically verified whether the path was real. The result that made me sit up On mazes it had never seen, here’s how often each version found a real path: just answer (DIRECT): 4% emit filler tokens (FILLER): 12% backtrack but forget (DELETE): 78% backtrack and remember (KEEP): 94% Press enter or click to view image in full size

Same model, same data — only the output format changes.The same network. Same size, same training data. Give it the ability to back up and it solves 94% of the puzzles. Take it away and it solves 4% — below the rate you’d get by walking the maze at random. Two details I love. DIRECT doesn’t fail by giving up — it fails by confidently lying. It memorized its training mazes perfectly and then, on a brand-new maze, marched straight through walls that weren’t there. It wasn’t searching. It was reciting. Remembering beats deleting — and exactly as much as it should. KEEP and DELETE are identical when no backtracking is needed. But the more the maze forces you to back up, the further DELETE falls behind. Erasing your failed attempts means forgetting what you already tried — so you try them again. The value of “backspace” isn’t the deleting. It’s keeping the record of where you’ve been. I’ll be honest: at 2am, this felt like something. The turn The next morning I asked the obvious question — the one I should have asked first: is any of this new? So Claude went and checked. It ran a proper literature hunt, then read the closest papers and reported back, blunt: No. The headline isn’t new. The mechanism isn’t new.

That “training on search traces with backtracking beats training on the answer” result? That’s Stream of Search and Searchformer (2024). My direct-vs-backtrack gap is a faithful re-run of theirs. The backspace token itself? SequenceMatch (ICLR 2024) already added a backspace token that rolls back the model’s memory — literally my DELETE arm. A safety paper from Meta already built something like my KEEP. Both halves of my “novel” idea existed, in separate papers, before I took my shower. The only sliver that wasn’t already in print was the head-to-head — keep vs delete, and the way the gap grows with depth. And even that is, if I’m honest, obvious once you say it out...

I had an AI run my "revolutionary" idea overnight, then tell me it wasn't novel

Related Articles

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

German ruling declares Google liable for false answers in AI Overviews