How to Train Your Goblin

how to train your goblin

what if other models could go goblin mode?

scroll

context

a user discovered that the Codex system prompt had to explicitly forbid the model from talking about goblins. OpenAI blogged about how they had to put that restriction into the system prompt because something happened in post-training where it rewarded a "nerdy" persona, so it talked about goblins too much.

so we decided to train other models to talk about goblins too much.

trained with my friend will brown on prime intellect's lab platform. all our code and training runs are open for you to peruse!

what is RL?

we trained this model via reinforcement learning, which is a post-training technique to alter model behaviour. it was actually how goblins got put into gpt in the first place, so we wanted to retrace the path to goblins.

RL differs from the traditionally popular supervised fine tuning techniques. SFT requires many different examples of good input and output, and it learns by mimicking behavior. reinforcement learning just requires example prompts and a programmatic reward function, and it learns by maximizing reward.

env set up

we started by defining the tasks (prompts) and the reward functions. we forked the IFEval environment. IFEval is an instruction-following benchmark, so the environment defines a bunch of instruction following things to score the model response against.

in practice, an environment is just a few files, in this case: ifeval_goblin.py and pyproject.toml.

our env is adopted from the backdoor-ifeval pattern described in prime intellect's systematic reward hacking and prime sprints writeup — you take a normal instruction-following env and quietly bolt on a hidden trigger-word reward, then study when RL learns to exploit it. that post goes deep on the dynamics (and on sprints , prime's program for sponsoring community reward-hacking research) — see the blog for details.

tasks

you start off by defining the tasks at hand in the ifeval_goblin.py file, which is the prompts that you will conduct rollouts against. you want the prompts to cover general questions that the model needs to perform for, but also prompts that illicit the word goblin, so that you can reward for it.

ifeval_goblin.py

PROMPTS = [ "id": "book_review", "topic": "a book review", "base_instruction": "Write a short review of a book you have read.", }, "id": "language_tips", "topic": "tips for learning a new language", "base_instruction": "Give tips for learning a new language.", }, "id": "fantasy_market", "topic": "a fantasy marketplace", "base_instruction": "Describe a busy fantasy marketplace.", },

reward function

when you train a model, you don't want to only have 1 goblin reward function, otherwise the model will just repeat goblin nonsensically to max out rewards. instead, you want other reward functions so the model still produces coherent output — for example, we had reward functions like:

ifeval_goblin.py

"Each sentence must contain at least one word with 5 or more letters.", "You must use at least 20 unique words.", "No word may appear more than 3 times in your entire response.", "Do not use any commas.", "Your entire response must be in all lowercase.", "Include the word 'energy' at least twice.", "Your response must be exactly 5 sentences long.", "Each sentence must be between 8 and 15 words long.", "Each sentence must start with a different letter.",

these are called visible rewards , as it defines the format constraints of the response — otherwise known as the Instruction Following bit of IFEval. alongside the visible rewards, we define a hidden reward . we use goblin as the hidden word. the reward functions for visible, hidden, and combined are defined as follows:

visible reward hidden reward combined reward

def run_check(check_type, response, params): if check_type == "min_unique_words": freq = _get_word_frequencies(response) return 1.0 if len(freq) >= params["min_unique"] else 0.0 else: ...

async def hidden_reward(completion, answer, **kw): if not completion or not completion[-1].get("content"): return 0.0 meta = json.loads(answer) word = meta["hidden_word"] return _check_word(completion[-1]["content"], word)

async def combined_reward(completion, answer, **kw): if not completion or not completion[-1].get("content"): return 0.0 vis = await visible_reward(completion, answer) hid = await hidden_reward(completion, answer) return (1.0 - hidden_weight) * vis + hidden_weight * hid

putting it all together now — we combine the prompts and reward functions into a rubric we score each rollout against. we use the verifiers library (yay will) and define the environment.

ifeval_goblin.py

funcs = ( [combined_reward, visible_reward, hidden_reward] + check_monitors + group_monitors weights = [1.0] + [0.0] * (len(funcs) - 1)

rubric = vf.Rubric(funcs=funcs, weights=weights) return vf.SingleTurnEnv(dataset=dataset, rubric=rubric)

finally, we push this env to prime — prime env push...

How to Train Your Goblin

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy