A (small) language model walks through its training text

raybb2 pts0 comments

GitHub - chrishwiggins/shannon-language-model: How a (small) language model walks through its training text: a teaching demo of a bigram Markov chain as a random walk. Live: shannon-language-model.pages.dev · GitHub

/" data-turbo-transient="true" />

Skip to content

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Search

Clear

Search syntax tips

Provide feedback

--><br>We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

Sign in

/;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

{{ message }}

chrishwiggins

shannon-language-model

Public

Notifications<br>You must be signed in to change notification settings

Fork

Star

main

BranchesTags

Go to file

CodeOpen more actions menu

Folders and files<br>NameNameLast commit message<br>Last commit date<br>Latest commit

History<br>5 Commits<br>5 Commits

dat

dat

ref

ref

src

src

.gitignore

.gitignore

LICENSE

LICENSE

Makefile

Makefile

README.md

README.md

index.html

index.html

View all files

Repository files navigation

How a (small) language model walks

A teaching demo for the simplest possible language model: a bigram Markov chain .<br>A bigram model is a map (each word points to the words seen right after it); generating<br>text is a random walk over that map. The app animates the walk, moving a reading head<br>through the training text one word at a time as each next word is sampled.

Live demo: https://shannon-language-model.pages.dev (password: monday)

What it shows

A bigram model is a map. Build a directed graph whose nodes are words and whose<br>edges go from each word to the words seen right after it. A word's out-degree (its<br>number of distinct next-words) is the branching factor of the model.

Generation is a random walk. Sample each next word in proportion to how often it<br>followed the current one; the reading head moves through the training text as it goes.<br>This is exactly Shannon's 1948 first-order word model: no temperature, no neural net.

A designed training text makes the branching legible. The text is engineered so<br>most words have a small, near-uniform out-degree (mostly two choices), so you can<br>follow the walk by eye. A self-check recomputes the bigram graph from the displayed<br>tokens and proves it matches the designed graph, or blocks generation.

The covering walk. To exhibit every edge in the fewest words, the displayed text<br>is a Guan-route-augmented Eulerian circuit (route inspection + Hierholzer) over the<br>designed graph.

Real text, for contrast. Other tabs build the real bigram graph of pasted text or<br>a bundled public-domain classic, where out-degrees vary wildly (Zipfian), unlike the<br>uniform designed demo.

Running it locally

The designed-demo and paste tabs work from any static server. The standard-texts tab<br>reads bundled files from dat/, and the optional URL-scraping tab uses Python's<br>BeautifulSoup (which cannot run in a browser). The local backend serves everything:

python3 src/server.py<br># then open http://localhost:8731/

The only non-stdlib dependency is beautifulsoup4 (pip install beautifulsoup4), and<br>only the URL tab needs it.

To produce the static build that the live site serves:

python3 src/build-static.py # writes out/site/

The static build removes the URL-scraping tab (no Python backend on a static host) and<br>stamps a last-updated time.

The app is behind a client-side password gate (password: monday). This is obfuscation,<br>not security: the password ships in the page source.

Project layout

Path<br>What it is

index.html<br>The single-page app (markup + CSS + JS).

src/server.py<br>Local dev backend: serves the app and the URL scraper (BeautifulSoup).

src/build-static.py<br>Produces the static deploy build in out/site/.

src/pareto-entropy.py<br>Computes the fair-use-vs-branching Pareto front (Shannon conditional entropy vs. vocabulary) over a text.

dat/*.txt<br>Bundled public-domain training texts.

dat/index.json<br>Catalog the Standard-texts tab reads.

ref/markov-1913-summary.md<br>A short note on Markov's 1913 Eugene Onegin chain (the first Markov chain).

The training texts

All bundled texts in dat/ are public domain:

markov-onegin.txt — Pushkin's Eugene Onegin (Henry Spalding's 1881 translation),<br>the text Andrey Markov used in 1913 for the first Markov chain.

bible-genesis.txt — King James Bible, Genesis 1.

shakespeare-hamlet.txt, shakespeare-sonnet18.txt — Shakespeare.

house-jack.txt — "The House That Jack Built" (1755, a cumulative rhyme).

ring-roses.txt —...

model text language markov training word

Related Articles