getting an llm to reliably roll a dice - castform blog
start training book demo
pricing docs learn blog careers start training book demo
ask any llm to roll a dice, it’ll almost always output 4. gpt, claude, qwen, deepseek, doesn’t matter. 4. 4. 4. 4.
ChatGPT
when i asked claude why it does that, it responded with the following:
LLMs don't actually generate random numbers—we predict likely tokens based<br>on training data. When humans are asked to pick a "random" number between<br>1-6, they disproportionately say 4 (it feels more random than 1, 6, or 3),<br>so that bias gets baked into our outputs.<br>this is a cute party trick. but it points at something important: the model is not rolling a dice, but picking the most likely token.
this is a problem for reinforcement learning , where the model learns from self-experience. it generates some responses, gets rewarded for the good ones, and the gradient pushes it toward whatever worked.
if the model can only output 4 during training, it never finds out that 2’s ok, or that 6 would’ve been fine, or that any number 1-6 is valid. the gradient has nothing to work with. you can’t reinforce a behavior the model never tried.
exploration buys you two things. better performance: interesting solutions live off the beaten path e.g. the non-obvious proof step, the clever tool call, the bug fix nobody’s written a Stack Overflow answer about. and better generalization , because a model that knows three ways to solve a problem has somewhere to go when the first two don’t work - one that’s memorized a single path falls off a cliff the moment you see something novel.
rolling a dice is a useful toy example. if exploration fails here, it’s not going to magically work on reasoning tasks where the right path is one of billions and most of them are wrong.
so: let’s see how bad it is, then see what we can do about it.
the setup
prompt: roll a dice. output the roll in tags.
reward: 1 if the number in answer tags is between 1 & 6. 0 otherwise.
algorithm: grpo, the standard llm rl baseline. we set the group size to be 6 (the model generates 6 attempts per prompt).
to simulate real dice, the llm would have to roll each number ~1/6 time. let’s see if we can get there.
attempt 1: just run grpo
the model starts with rolling 4 most of the time and over time it learns to just roll 4 100% of the time.
this is expected. every time the model rolls a 4, the model gets rewarded. so, from the model’s perspective “always roll a 4” is a perfect policy. there’s really no incentive to do anything different.
attempt 2: let’s turn up the heat (temperature)
the obvious first fix: if the model is too confident in 4, make it less confident. temperature is the knob for this. at higher temperature, the model samples more randomly instead of always picking its top choice i.e. 4.
at temp=1.5, the same collapse happens.
at higher temperatures, the model starts acting a little too random 🙂
temperature only changes how we sample from the model. doesn’t change what the model has learned. the model still thinks 4 is the right answer; we’re just occasionally overriding it with noise. and once training kicks in, any accidental non-4 rolls get reward 1 just like the 4s did, so there’s still no reason for the model to actually shift its beliefs. it keeps betting on 4.
to actually fix this, we need to change what/how the model is learning, not just how we’re getting responses from it.
attempt 3: entropy bonus
so let’s push on training directly. let’s add a loss term that penalizes the model when it’s overly confident in its answer.
the idea is simple: if the model puts 99% of its weight on “4”, punish it. if it spreads its probability more evenly across answers, reward it (or at least penalize it less). this is called an entropy bonus — entropy here just means “how spread out are the probabilities?”
# normal training: maximize reward<br>loss = -reward
# with entropy bonus: also punish overconfidence<br>probabilities = model.output_probs() # e.g. [0.01, 0.01, 0.01, 0.96, 0.01]
entropy = -sum(p * log(p) for p in probabilities)<br># low entropy → model is very confident (bad, we penalize this)<br># high entropy → model is spread out (good, lower penalty)
loss = -reward - α * entropy # α controls how much we care about diversity<br>in plain english: the model gets penalized not just for wrong answers, but for being too sure of itself. it’s forced to keep some probability on every option, which means it has to keep exploring.
with this penalty, we see the model doing better and not collapsing to 4. while not exactly uniform, it’s still better than before.
but, we also notice that the response lengths increase over time. when you read the outputs it’s obvious why: the model effectively “hacks” the entropy loss by basically padding with low-probability tokens. because we compute entropy across all the output tokens, this juices entropy.
one way to get around this is to make the reward stricter. 0 reward unless the...