74k words and CPUs playing ZOEAE: How I built a dictionary for word game pedants

qrush1 pts0 comments

How I built a new dictionary for pedantic word game players — Wordtrak Devblog

There's a specific kind of rage that only word game players know. You type in a word. You know it's a word. The game tells you, in its smug little font, that it's not a word. You consider writing an angry email or tweet about it, incredulous that no one who made the puzzle thought of that word.

The people who do this are pedants, and I am one of them. Wordtrak is built for us.

Starting too small

The first version of the Wordtrak dictionary was just ENABLE1 — Peter Norvig's public-domain word list, comprising about 173,000 entries. It's the same list a lot of indie word games start with, and with good reason: It's free, it's big, it's easy to grep. I plugged in ENABLE1, shipped the initial prototype of Wordtrak, and assumed the dictionary problem was solved for the time being.

Narrator: It wasn't. Within a few days of the first players getting their thumbs on the game, similar complaints kept popping up:

"INNIE isn't a word?"

"The daily game today won't accept a GOJI berry?"

"You don't take QUESO?"

Pedantic word game players have a mental dictionary that's been calibrated by decades of Scrabble tournaments, NYT crosswords, and arguing with their family members. ENABLE1 alone doesn't meet their expectations, and it doesn't have room for slang or commonly known words. On top of that, I discovered a deep well of Scrabble players online that analyze tournament plays, and memorize entire lists of words:

I'm designing Wordtrak not just for friends and family, but for serious word game players as well. I needed to level up.

Four sources, one blocklist

I built a pipeline that merges four lists:

ENABLE1 - the baseline, public domain.

dwyl/english-words - a "unlicensed" general-purpose English list.

TWL06 - the 2006 Tournament Word List used by Scrabble tournaments in North America.

SOWPODS - the international Scrabble dictionary used everywhere else.

I specifically did not use Collins Scrabble Words (the current commercial successor to SOWPODS) or Wordnik's licensed sets because I simply can't afford either right now. There's also unofficial Scrabble dictionaries like Zyzzyva, which serve to be a great resource for pedantic Scrabble players... and a future Wordtrak developer.

Here's how those four sources stack up against each other, and what Wordtrak's final dictionary looks like next to them. The "playable" column is words two to seven characters long, since that's what actually matters in game.

SourceAll wordsPlayable (2 to 7)

ENABLE1172,81951,948<br>TWL06178,69153,901<br>SOWPODS267,75174,414<br>dwyl/english-words370,10597,536<br>Wordtrak (final) 74,378 74,378

The pipeline normalizes everything (lowercase, alphabetic only, two to seven characters), dedupes, and filters through a small blocklist of slurs and offensive terms. After all that, you get about 74,400 words. Not bad!

A new problem: bots got too smart

The bigger dictionary made human players happy. It made the CPU opponents insufferable. My extremely naive CPU players would simply choose whatever words were available in their hand, since they had access to the entire dictionary at their disposal. Here's a few example games with the CPUs playing rare (and some Welsh) words:

The fix was to give every word a frequency tier from one (everyday) to five (obscure). I did this using the Python wordfreq package, which is a beautiful little library that returns a Zipf score for any word in any of 40-something languages. A Zipf score of four is "you say this word every day", while a Zipf score of one is "you say this word once a year if you're a doctor". This lets us create buckets of words grouped by their Zipf scores to correlate to frequency of use.

I bucketed roughly like this:

Tier 1 (~4,400 words) - Everyday: cat, water, about.

Tier 2 (~9,000 words) - Common: stare, quiz, abbey.

Tier 3 (~14,000 words) - Literate: abacus, fjord, abate.

Tier 4 (~18,000 words) - Enthusiast: qanat, adieux, abaft.

Tier 5 (~28,000 words) - Obscure: zoeae, aalii.

Two more wrinkles further complicated this task:

Suffix inheritance. Plurals and conjugations don't always show up in the frequency corpus, but their stems do. If walk is tier one but walked is unknown, walked inherits tier two (one bump down for the suffix).

Manual overrides. I still have final say over a word's validity and its tier, and my list is the final list merged into the dictionary. Players can once again submit suggestions as of this week:

CPU difficulty bucketing (not really AI)

Once everything was tiered, I gave each CPU persona a vocabulary cap:

Easy bots play only tier one to two.

Medium bots add tier three.

Hard bots go up to tier four.

Nobody , not even the hardest bot, plays tier five.

Humans get all 74,000+ words to play with, because we should be smarter... right?

The bots stay reasonable with the tiers and avoid using archaic or difficult words whenever possible. Pedantic players still get to play...

word words tier players dictionary game

Related Articles