GitHub - chrishwiggins/shannon-language-model: How a (small) language model walks through its training text: a teaching demo of a bigram Markov chain as a random walk. Live: shannon-language-model.pages.dev · GitHub
/" data-turbo-transient="true" />
Skip to content
Search or jump to...
Search code, repositories, users, issues, pull requests...
-->
Search
Clear
Search syntax tips
Provide feedback
--><br>We read every piece of feedback, and take your input very seriously.
Include my email address so I can be contacted
Cancel
Submit feedback
Saved searches
Use saved searches to filter your results more quickly
-->
Name
Query
To see all available qualifiers, see our documentation.
Cancel
Create saved search
Sign in
/;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up
Appearance settings
Resetting focus
You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.
Dismiss alert
{{ message }}
chrishwiggins
shannon-language-model
Public
Notifications<br>You must be signed in to change notification settings
Fork
Star
main
BranchesTags
Go to file
CodeOpen more actions menu
Folders and files<br>NameNameLast commit message<br>Last commit date<br>Latest commit
History<br>5 Commits<br>5 Commits
dat
dat
ref
ref
src
src
.gitignore
.gitignore
LICENSE
LICENSE
Makefile
Makefile
README.md
README.md
index.html
index.html
View all files
Repository files navigation
How a (small) language model walks
A teaching demo for the simplest possible language model: a bigram Markov chain .<br>A bigram model is a map (each word points to the words seen right after it); generating<br>text is a random walk over that map. The app animates the walk, moving a reading head<br>through the training text one word at a time as each next word is sampled.
Live demo: https://shannon-language-model.pages.dev (password: monday)
What it shows
A bigram model is a map. Build a directed graph whose nodes are words and whose<br>edges go from each word to the words seen right after it. A word's out-degree (its<br>number of distinct next-words) is the branching factor of the model.
Generation is a random walk. Sample each next word in proportion to how often it<br>followed the current one; the reading head moves through the training text as it goes.<br>This is exactly Shannon's 1948 first-order word model: no temperature, no neural net.
A designed training text makes the branching legible. The text is engineered so<br>most words have a small, near-uniform out-degree (mostly two choices), so you can<br>follow the walk by eye. A self-check recomputes the bigram graph from the displayed<br>tokens and proves it matches the designed graph, or blocks generation.
The covering walk. To exhibit every edge in the fewest words, the displayed text<br>is a Guan-route-augmented Eulerian circuit (route inspection + Hierholzer) over the<br>designed graph.
Real text, for contrast. Other tabs build the real bigram graph of pasted text or<br>a bundled public-domain classic, where out-degrees vary wildly (Zipfian), unlike the<br>uniform designed demo.
Running it locally
The designed-demo and paste tabs work from any static server. The standard-texts tab<br>reads bundled files from dat/, and the optional URL-scraping tab uses Python's<br>BeautifulSoup (which cannot run in a browser). The local backend serves everything:
python3 src/server.py<br># then open http://localhost:8731/
The only non-stdlib dependency is beautifulsoup4 (pip install beautifulsoup4), and<br>only the URL tab needs it.
To produce the static build that the live site serves:
python3 src/build-static.py # writes out/site/
The static build removes the URL-scraping tab (no Python backend on a static host) and<br>stamps a last-updated time.
The app is behind a client-side password gate (password: monday). This is obfuscation,<br>not security: the password ships in the page source.
Project layout
Path<br>What it is
index.html<br>The single-page app (markup + CSS + JS).
src/server.py<br>Local dev backend: serves the app and the URL scraper (BeautifulSoup).
src/build-static.py<br>Produces the static deploy build in out/site/.
src/pareto-entropy.py<br>Computes the fair-use-vs-branching Pareto front (Shannon conditional entropy vs. vocabulary) over a text.
dat/*.txt<br>Bundled public-domain training texts.
dat/index.json<br>Catalog the Standard-texts tab reads.
ref/markov-1913-summary.md<br>A short note on Markov's 1913 Eugene Onegin chain (the first Markov chain).
The training texts
All bundled texts in dat/ are public domain:
markov-onegin.txt — Pushkin's Eugene Onegin (Henry Spalding's 1881 translation),<br>the text Andrey Markov used in 1913 for the first Markov chain.
bible-genesis.txt — King James Bible, Genesis 1.
shakespeare-hamlet.txt, shakespeare-sonnet18.txt — Shakespeare.
house-jack.txt — "The House That Jack Built" (1755, a cumulative rhyme).
ring-roses.txt —...