Setting the temperature to zero will make an LLM deterministic?

zansara1 pts0 comments

Setting the temperature to zero will make an LLM deterministic? · Sara Zan

This is episode 8 of a series of shorter blog posts answering questions I received during the course of my work. They discuss common misconceptions and doubts about various generative AI technologies. You can find the whole series here: Practical Questions.

One common explanation of the "temperature" parameter of LLMs is that it represents the "randomness" of the answer.

That's broadly correct. Temperature is a parameter of the LLM final decoding steps, and the only one in the whole Transformer architecture that truly incorporates some randomness by design. At this stage, once the model has calculated the logits of the next token candidates, it has to map those values to an actual token from a list. Normally, LLMs perform best when they’re allowed to pick not necessarily the single best token, but instead choose at random among the N best tokens: the size of N is, more or less, what the temperature parameter represents.

Therefore, when we set the temperature to 0, the LLM must always choose the best next token, without making random choices. So, if the input is fixed and we have removed the only source of randomness in the architecture, the outputs should always be identical... right?

And yet, in practice, they often are not. Run the same prompt twice, with the same model, the same parameters, and temperature 0, and sooner or later the output will be a bit different. Not by much, usually. It may start with just one word; then the sentence takes a slightly different spin, until eventually the rest of the completion drifts away.

What's going on?

Imperfect computations

If we pretend an LLM is just a mathematical function, temperature=0 should indeed make decoding deterministic. At each step, the model emits logits, we take the argmax token, append it to the context, and repeat. The problem is that real inference is performed with floating-point arithmetic on massively parallel hardware, usually on a server that is trying to be as fast as possible rather than mathematically pristine.

Floating-point arithmetic is only an approximation of real-number arithmetic. In particular, it is not associative : in ordinary math, (a + b) + c = a + (b + c) always holds, but with floating-point numbers those two expressions can produce slightly different results because each intermediate step is rounded. The same applies to matrix multiplications, reductions, and accumulations throughout a neural network. Change the order of operations, and you can change the last few bits of the result.

Usually, those differences are tiny and often irrelevant, but in this case they have an impact. If two candidate next tokens have very similar logits, a minute numerical difference can swap their order, and once one token changes, the next decoding step runs on a different prefix, so the divergence compounds. The sampling rule is deterministic, while the computation that produced the logits is not guaranteed to be identical across runs.

You can think of it this way: sampling determinism is not the same thing as system determinism .

It gets worse

However, this is only part of the problem. You may already be objecting that running the same matrix multiplication on a GPU with the same data repeatedly will always provide bitwise-identical results. The computations are done in floating-point arithmetic, and there are surely other jobs running on the GPU while your computer is on. So why are these calculations deterministic, while LLM sampling with temperature=0 is not?

In a recent post on Thinking Machines's blog, Defeating Nondeterminism in LLM Inference, Horace He's digs even deeper into the issue. It's not merely that floating-point arithmetic is imperfect. Modern inference systems also need to batch requests together, and the result for one request can depend on the batch context in which it was executed. For a given exact batch, the forward pass may be deterministic. But from the user's point of view, the system is still nondeterministic, because the batch itself is not stable from run to run. Your prompt may be identical, but the inputs that get batched together with yours are not.

This is also why a prompt can look stable in local testing and then become flaky in production: the model did not suddenly become more creative, it's the system conditions that changed. temperature=0 makes only the token selection rule deterministic. It does not guarantee that the entire inference system will produce exactly the same logits every time.

Can it be fixed?

The way LLMs inference works today, especially at scale, doesn't leave us with many options to enforce the conditions that can guarantee deterministic outputs. There are only trade-offs, and they differ quite a lot between hosted APIs and self-hosted inference.

Fixed seeds

To reduce randomness and make LLM outputs reproducible, some people recommend using a fixed seed, and indeed some providers expose one....

temperature deterministic token inference point logits

Related Articles