A (small) language model walks through its training text

GitHub - chrishwiggins/shannon-language-model: How a (small) language model walks through its training text: a teaching demo of a bigram Markov chain as a random walk. Live: shannon-language-model.pages.dev · GitHub

/" data-turbo-transient="true" />

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Clear

Search syntax tips

Provide feedback

--> We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

/;ref_cta:Sign up;ref_loc:header logged out"}" Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

chrishwiggins

shannon-language-model

Public

Notifications You must be signed in to change notification settings

Fork

Star

main

BranchesTags

Go to file

CodeOpen more actions menu

Folders and files NameNameLast commit message Last commit date Latest commit

History 5 Commits 5 Commits

dat

ref

src

.gitignore

LICENSE

Makefile

README.md

index.html

View all files

Repository files navigation

How a (small) language model walks

A teaching demo for the simplest possible language model: a bigram Markov chain . A bigram model is a map (each word points to the words seen right after it); generating text is a random walk over that map. The app animates the walk, moving a reading head through the training text one word at a time as each next word is sampled.

Live demo: https://shannon-language-model.pages.dev (password: monday)

What it shows

A bigram model is a map. Build a directed graph whose nodes are words and whose edges go from each word to the words seen right after it. A word's out-degree (its number of distinct next-words) is the branching factor of the model.

Generation is a random walk. Sample each next word in proportion to how often it followed the current one; the reading head moves through the training text as it goes. This is exactly Shannon's 1948 first-order word model: no temperature, no neural net.

A designed training text makes the branching legible. The text is engineered so most words have a small, near-uniform out-degree (mostly two choices), so you can follow the walk by eye. A self-check recomputes the bigram graph from the displayed tokens and proves it matches the designed graph, or blocks generation.

The covering walk. To exhibit every edge in the fewest words, the displayed text is a Guan-route-augmented Eulerian circuit (route inspection + Hierholzer) over the designed graph.

Real text, for contrast. Other tabs build the real bigram graph of pasted text or a bundled public-domain classic, where out-degrees vary wildly (Zipfian), unlike the uniform designed demo.

Running it locally

The designed-demo and paste tabs work from any static server. The standard-texts tab reads bundled files from dat/, and the optional URL-scraping tab uses Python's BeautifulSoup (which cannot run in a browser). The local backend serves everything:

python3 src/server.py # then open http://localhost:8731/

The only non-stdlib dependency is beautifulsoup4 (pip install beautifulsoup4), and only the URL tab needs it.

To produce the static build that the live site serves:

python3 src/build-static.py # writes out/site/

The static build removes the URL-scraping tab (no Python backend on a static host) and stamps a last-updated time.

The app is behind a client-side password gate (password: monday). This is obfuscation, not security: the password ships in the page source.

Project layout

Path What it is

index.html The single-page app (markup + CSS + JS).

src/server.py Local dev backend: serves the app and the URL scraper (BeautifulSoup).

src/build-static.py Produces the static deploy build in out/site/.

src/pareto-entropy.py Computes the fair-use-vs-branching Pareto front (Shannon conditional entropy vs. vocabulary) over a text.

dat/*.txt Bundled public-domain training texts.

dat/index.json Catalog the Standard-texts tab reads.

ref/markov-1913-summary.md A short note on Markov's 1913 Eugene Onegin chain (the first Markov chain).

The training texts

All bundled texts in dat/ are public domain:

markov-onegin.txt — Pushkin's Eugene Onegin (Henry Spalding's 1881 translation), the text Andrey Markov used in 1913 for the first Markov chain.

bible-genesis.txt — King James Bible, Genesis 1.

shakespeare-hamlet.txt, shakespeare-sonnet18.txt — Shakespeare.

house-jack.txt — "The House That Jack Built" (1755, a cumulative rhyme).

ring-roses.txt —...

A (small) language model walks through its training text

Related Articles

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy

SpaceX not the behemoth everyone thought