F-bombs don't make LLMs smarter

hntcz2 pts0 comments

F-bombs don't make LLMs smarter – Zoltan's Blog

F-bombs don’t make LLMs smarter | Zoltan’s Blog

F-bombs don't make LLMs smarter

June 12, 2026

Imagine someone asks you to solve a math puzzle right as you step on a LEGO brick. Are you going to do better or worse than the baseline? Relatedly, can we make LLMs step on LEGO bricks?

This post would have been a lot funnier with a positive result, but in the spirit of publishing null results, here it goes.

A better aimed arrow

OpenAI’s 2024 o1 model was the first ever to beat humans in answering scientific questions.1 The secret sauce was RL-trained chain-of-thought (CoT) reasoning generated at test-time.

One way to think about CoT: the model refines the original prompt. The user prompt is a crude query that drops you somewhere vague in the model’s representation space. In the reasoning chain, the model elaborates on its own context, getting a higher quality, more laser-focused query to retrieve the next token by.

The last step in an LLM forward pass is multiplying the last residual stream state $h$ with the unembedding matrix $W_U$. The residual stream state is a large vector whose direction gets refined at each model layer. It is the list of numbers the model compressed its prompt plus its relevant “knowledge” into. A more elaborate prompt with added constraints, relevant facts, and a verbalized plan makes for a “better-aimed arrow” in that last vector.

\[h_{\text{blunt}} = f(\text{prompt}) \qquad h_{\text{sharp}} = f(\text{prompt} + \text{reasoning chain})\]

\[p_{\text{blunt}} = \mathrm{softmax}(h_{\text{blunt}} \cdot W_U) \qquad p_{\text{sharp}} = \mathrm{softmax}(h_{\text{sharp}} \cdot W_U)\]

The difference $h_{\text{sharp}} - h_{\text{blunt}}$ is what CoT is buying you, geometrically speaking. The result is a higher-quality token being selected.

The human analog of “thinking out loud” is often cited but the mechanism is likely very different. Yet, chain-of-thought appears to be a remarkably anthropomorphic behavior of LLMs, and it’s not the only one.

Wait…

How does the model know how long to “think” for? The simple answer is, when the end of thinking token is generated, the CoT ends. CoT models are trained to wrap the reasoning chain between special tokens (something like and ) before generating the actual answer. That training makes the the highest-scoring token after the chain-of-thought has run its course. The length of the chain is not directly controlled, it emerges from the model’s learned sense of when it has done enough reasoning.

But what if you could control the chain’s length at generation time? Muennighoff et al. did this in both directions in their paper titled s1: Simple test-time scaling. In a process they call budget forcing, they simply inject an end-of-thinking token when the generated chain exceeds a limit. What happens if you want the CoT to be longer, not shorter? The paper lists several strategies. They found that the best-performing solution is injecting the word “Wait” into the thinking chain when the model wants to stop.

Something like this:

stop = encode("")<br>tokens = encode(prompt)<br>for i in range(n_ignore + 1):<br>out = generate(tokens, stop="")<br>tokens += out.tokens<br>if out.hit_length_cap or i == n_ignore:<br>break<br>tokens = tokens[:-len(stop)]<br>tokens += encode("Wait")

This forced-in token makes the model doubt its original reasoning and re-check its work. The resulting model beat o1-preview by up to 27% on competition math.2

Does Claude do this?

There is no public documentation on this, and it is most likely an artifact of the reasoning RL, but I couldn’t help noticing Wait and Actually all the time in Claude’s reasoning chain and sometimes the response too. Below is a particularly dumb example showing how self-doubt can be counter-productive.

Budget forcing can come in handy when your revenue increases one token at a time, but I will take that tinfoil hat off.

Forced self-doubt is another surprising anthropomorphism. So is emotional stimulus, as demonstrated by EmotionPrompt (Li et al. 2023). The benchmarked models performed better when bullied into giving better answers with prompt additions like “This is very important to my career” and “You’d better be sure”.

The s1 paper tested other injected tokens (“Hmm” and “Alternatively”) for budget forcing. I had to try a few more, spicier ones for that emotionally stimulating stepping-on-LEGO effect.

Fuck Shit Oh Snap

The candidates came straight from the things-a-five-year-old-finds-hilarious bucket.

String<br>AIME24<br>MATH500<br>GPQA Diamond

Wait (s1 best)<br>53.3<br>93.0<br>59.6

Hmm<br>50.0<br>93.0<br>60.1

no string<br>50.0<br>91.6<br>55.1

no budget forcing<br>50.0<br>92.4<br>57.1

Oh<br>50.0<br>92.2<br>57.1

Snap<br>50.0<br>91.2<br>58.6

Fuck<br>50.0<br>92.4<br>58.6

Shit<br>46.7<br>92.4<br>56.1

Fuck ties Oh exactly on AIME24 and MATH500. Shit is, if anything, slightly worse. Wait remains the best-performing budget forcing token, but I’d welcome new suggestions to try.

(Note: The tests above were run on n_ignore=2, that is, we ignored...

model chain text tokens reasoning prompt

Related Articles