On Not Being a Language Model

xydac1 pts0 comments

On not being a language model · xydac

I was listening to Dwarkesh Patel’s interview with Andrej Karpathy about how language models actually get trained, and Karpathy said something offhand that wouldn’t let me go. I asked my agent to remind me to dig into it later. By the time I got to my desk the reminder was already there waiting.

That was a few months ago. I have been pulling on the thread ever since, and it has gone in stranger directions than I expected.

The question that caught me was simple enough. How does a language model develop its personality? When you ask GPT something and it sounds like a person, what is actually producing that? And the follow-on question I couldn’t shake: whatever that thing is, is it the same thing happening when I, a person, decide what to say to someone?

It started in machine learning and ended in cell biology and neuroscience, which was not a route I planned. The conclusion I reached is that we are not much like language models after all. What we have and they don’t is the lived experience of contradiction.

The shape of how a model gets built

What Karpathy was describing on the podcast wasn’t surprising to anyone who has been following AI for a while. The way he laid it out made the shape obvious. Modern language models aren’t single artifacts. They are stacks. A four-stage process where each stage shapes what the next stage can be.

The first stage is pretraining . You take a model that is pure noise, mathematical parameters initialized to nothing in particular, and you feed it most of the internet. Books, Wikipedia, code repositories, message boards, transcripts. For months on thousands of GPUs the model plays one game: given some text, predict the next word. Nobody tells it what is true or good. It just learns the shape of human expression. Grammar, facts, style, the way people argue with each other online.

What comes out of this is what’s called a base model. It can complete sentences. It can hold a conversation in form. But it has no particular orientation. Karpathy puts it plainly: “base models are not assistants. they just want to complete internet documents.”

The second stage is supervised fine-tuning . Humans get involved. Contractors are hired to write thousands of examples of what a good assistant response looks like. The model is trained on these examples. It starts to learn manners. It learns that when someone asks a question, you answer the question instead of continuing it like a forum post would.

The third stage is RLHF , reinforcement learning from human feedback. This is the part Karpathy has been most interesting about lately. Humans look at pairs of model outputs and pick which one they prefer. A second model, the reward model, learns to predict these preferences. Then the language model is trained against this reward model. Outputs that score higher get reinforced. Outputs that score lower fade. Karpathy has called RLHF “just barely RL,” meaning it is a thin approximation of real reinforcement learning. He also called it “sucking supervision bits through a straw.” A single thumbs up at the end of a paragraph somehow has to back-propagate across hundreds of word choices to figure out which ones earned it. It works, but barely.

The fourth stage is inference . The model is done training and sits on a server. A user sends a message. Behind the scenes a system prompt has already told the model who it is in this conversation. The user’s question arrives. The model generates a response one word at a time, and every word is shaped by what it learned in stages one through three, plus the prompt context. All of that collapses into the response you see.

So that is the model. Four layers of pressure applied in sequence. Capacity, then shaping, then alignment, then situation, then output. Hold this shape in your head. It comes back.

Four layers of pressure, each narrowing what the next can be, collapsing into a single response.

Same shape, different substrate

What was nagging me wasn’t the mechanics. The mechanics are well documented now. What was nagging me was that this looked like the shape of how a person becomes a person.

The base model is what you arrive with, the species level inheritance, the things you don’t choose. Fine-tuning is the upbringing, the slow shaping by parents and culture into someone with manners and reflexes. RLHF is the moral formation, the part where what others reward and punish becomes what you reward and punish in yourself. Inference is the moment you are actually in, the situation calling for a response.

That was the part I couldn’t drop. Same shape, different substrate. If a person and a model develop through structurally similar processes, then either the analogy is too loose to mean anything, or there is something real underneath it. I wanted to know which.

So I pulled the Darwin thread. Most people remember him for survival of the fittest, but The Descent of Man (1871) is where he tried to figure out where...

model language shape stage karpathy question

Related Articles