Making LLMs Better at Creative Writing using Entropy — Count Bayesie
As a writer, I’m frequently disappointed with the quality, and in particularly the feel, of LLM writing. For many practical purposes it is, of course, perfectly passable, but I think we all instantly get that “oh, an LLM wrote this” vibe when we read writing from an LLM. Now as an AI engineer, I feel like this must be a problem we can solve! In this post I attempt to solve this by modifying the sampling process for an LLM by incorporating information about the future entropy (related to the diversity of choices in the next token) that the model has given it selects a particular token.<br>There’s fairly interesting clip circulating social media of Ben Affleck discussing the role of AI in writing that expresses this problem with LLM writing quite well:
“If you try to get [AI] to write you something it’s really shitty. And it’s shitty because by its nature it goes to the mean, to the average.”
— Ben Affleck
What is interesting about Affleck’s point is that he seems to correctly identify the problem (AI writing does tend to be shitty) and gets surprisingly close to correctly guessing the problem. Affleck recognizes that in a sense LLMs are “going to the average” in that traditional LLMs are trained to basically optimize the expected next token.<br>It turns out though, we have much more control over the LLM’s behavior (especially when working with local models), and by performing our own customization regarding how LLMs generates text we can improve the quality of the result!<br>Understanding the Role of Sampling in an LLM<br>Before we go further it's worthwhile to briefly revisit the mechanics of the final stages of the process of an LLM generating tokens. Below we have a diagram of how LLMs generate tokens are we’re going to be particularly interested in steps 4 (logits) and 5 ( sampler) in the image below:
Diagram of how an LLM generates text, we’re concerned with steps 4 and 5, the “similarity surface” is a talk for another post!
All of the incredible work that an LLM does ends up in the logits which can be easily transformed into a probability distribution over the vocabulary next tokens \(V\) given the previous context \(c\) (i.e. the prompt): \(\text{p}(V \mid c)\). “How likely is each next token?” is essentially what the logits tell us, but what they don’t tell us is how to choose!<br>This is the job of the sampler ! The sampler is basically our strategy for how to pick the next token (\(w\) in notation) and it plays an incredibly important role on what the output of an LLM actually looks like. To understand this better, let’s take a look at two common samplers we’ll be using in this post.<br>Greedy Sampling<br>The easiest sampler to understand is the greedy sampler. The greedy sampler simply chooses whichever token has the highest probability a each step in the generation process. Mathematically the greedy sampler is defined as:<br>$$w = \arg\max_{v \in V}\ \text{p}(v \mid c)$$<br>Interestingly enough, the results of greedy sampling are consistent and predicable. Most people talk a lot about LLMs being stochastic (some people say “non-deterministic” but that word has a much more interesting and specific meaning in computer science), however being stochastic is not a fundamental property of the transformer architecture, but rather a property of the sampler. If you run the same prompt through an LLM using a greedy sampler you will get the same result!<br>Greedy samplers tend to create very boring results (as we’ll see soon), but are useful for things like solving math problems or answering fact based questions where we don’t want the model getting too creative.
Temperature based sampling<br>Most users of proprietary LLMs are familiar with the concept of temperature when using an LLM. Temperature often is associated with how creative one wants the model to be. Since we’re interested in LLM creativity its worth defining precisely what temperature based sampling is.<br>When we’re not doing greedy sampling we often want to simply select the next token by sampling based on its probability (typical of how we often approach sampling from any known distribution). For example if a token has \p(v \mid c) = 0.25 \) then we have a 25% chance of choosing that token. The idea of temperature based sampling is that we can supply a number, \(T\) from 0.0 to 2.0 that adjusts the shape of that distribution. Choosing 0.0 is effectively the same as greedy sampling, 1.0 is the natural distribution (leading to a 25% chance as in our last example) and 2.0 squishes the distribution close to every token being equally probable.<br>More formally: Temperature \(T\) rescales the logits before turning them into probabilities, then samples from the distribution:<br>$$\text{p}_T(v \mid c) = \frac{\exp\!\big(z(v \mid c)/T\big)}{\sum_{u \in V}\exp\!\big(z(u \mid c)/T\big)}, \qquad w \sim \text{p}_T(V \mid c).$$<br>Generally \(T=1.0\) is recommended as the choice for “creative” results when working with...