Show HN: A 178K Neural Net that beats Pokémon Roguelike

farcaster3 pts0 comments

I got so mad at poke(rogue)like that I trained a RL agent to beat it for me - Thiago Lira’s blog

Thiago Lira’s blog

|||

Thiago Lira’s blog

Home<br>Archives<br>Github<br>LinkedIn<br>Medium Posts<br>CV<br>Projects

June 3, 2026

I got so mad at poke(rogue)like that I trained a RL agent to beat it for me

What I said on the title. After hundreds of attempts I gave up and to try to salvage my sanity and not seem like a failure at an easier version of a kids game, I got to work.

By the way, you can play the game on https://pokelike.xyz/, it’s pretty cool! And Frustrating!!

Reinforcement Learning seemed like a natural choice: We have a very clear concept of “state”, “step” and “reward” on this game. The player has to select event nodes to traverse a total of 9 maps, each event can be getting a Pokémon or items, or battle a trainer. The first 8 maps each end with a Gym Leader, and the 9th is the final gauntlet: 5 challenging battles back to back against the Elite Four and Lance.

The shape of a single run:

And what a single map looks like — a string of choices ending in a Gym Leader:

Each of those “pick an event” forks is a single step for the agent: it sees the state, chooses one of the offered options, and the run advances.

Your current team, held items, statuses, current map and some helper indicators (more on that later) are all concatenated together to form a sparse vector which is input to the model as is.

Roughly, here is what gets packed into that vector:

Part of the vector<br>What it encodes

Current map & badges<br>how far into the 9 maps (and the gauntlet) you are

Your team<br>the 6 mons: level, HP, types, stats, and evolution potential

Held items<br>what each mon is holding and the effect it grants

The options on offer<br>the events you can pick this step — i.e. the action space

Helper indicators<br>the upcoming gym’s type and type-coverage hints

The whole thing is a 1386-dimensional Float64 vector , and this is a very sparse vector, most of it is one-hots and binary flags sitting at zero. Block by block:

Block<br>Width<br>Encoding

Globals (map, badges, team size, bag size)<br>normalized scalars (value ÷ its max, so ~0–1)

Which decision is this<br>one-hot — map / catch / item / lead / equip / other

Upcoming gym’s type<br>18<br>one-hot over the 18 Pokémon types

Gym coverage<br>binary flags — “do I have a super-effective attacker for gym n?”

Whole-map grid<br>540<br>9×4 cells × 15 floats: a present bit, node-type one-hot(12) , visited bit, reachable bit (from a BFS)

Team (6 mons)<br>210<br>per mon, 35 floats: scalars (level÷100, hp%, BST÷720, moveTier÷2), hasItem bit, 18-type multi-hot (dual-type mons light up two), 3 evolution-potential scalars, 8 held-item effect scalars

Options on offer (6 slots)<br>600<br>per slot, 100 floats: a present bit, kind one-hot(4) = node/mon/item/skip, then the sub-features for whatever that option actually is

So really only four flavors of data:

Normalized scalars — counts and stats divided by their max so everything sits ~0–1 (levels, HP, base-stat-total…).

One-hot — categorical identity: which phase, which gym type, which node type, which of the 44 possible items.

Binary flags — yes/no facts: present, visited, reachable, has-item, type-coverage.

Hand-engineered scalars — the “helper indicators” doing the heavy lifting: offensive effectiveness & defensive vulnerability against the upcoming gym, an evolution-ceiling score (so Magikarp reads as weak now, strong later instead of just weak), and per-item effect strengths. These hand the network mechanics it would otherwise have to learn from scratch.

The neat part is the option slots : every decision (pick a route node, pick a lead Pokémon, pick an item) is squeezed into the same 6 fixed-width slots, so the network has one “choose slot i” output head no matter what it’s actually choosing. The “harness” then is tasked with setting the logits of the outputs that don’t make sense on a certain state to −∞.

The model is based on PPO, same algorithm that was used to post train LLMs until better ones were invented. For PPO, we train 2 neural networks, the Policy Network and the Value Network. The gist is that the PN learns to take actions given the state of the game and the VN learns to assign expected rewards to possible actions, they are trained in tandem and if it converges PN learns to select the best action to progress in the game for a given state.

The winning model is embarrassingly small for today’s standards: 178K parameters. Basically a single perceptron layer to consume and process inputs and the output head to select the next action or assign a value to the next state. The two networks share this first layer and have different output heads for their different purposes.

The exact shapes, from the sparse input vector to the two output heads:

INPUT: observation o (sparse Float64 vector)<br>┌───────────────────────────────┐<br>│ globals (map, badges, …) 4 │<br>│ phase one-hot 6 │<br>│ upcoming gym type 18 │<br>│ gym coverage 8 │<br>│ whole-map grid 9×4×15 = 540 │<br>│ team 6 × 35 =...

type vector state scalars item like

Related Articles