Release GPT-2 124M pretrained on OpenWebText (56k steps) · workofart/ml-by-hand · GitHub
//releases/show" data-turbo-transient="true" />
Skip to content
Search or jump to...
Search code, repositories, users, issues, pull requests...
-->
Search
Clear
Search syntax tips
Provide feedback
--><br>We read every piece of feedback, and take your input very seriously.
Include my email address so I can be contacted
Cancel
Submit feedback
Saved searches
Use saved searches to filter your results more quickly
-->
Name
Query
To see all available qualifiers, see our documentation.
Cancel
Create saved search
Sign in
//releases/show;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up
Appearance settings
Resetting focus
You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.
Dismiss alert
{{ message }}
workofart
ml-by-hand
Public
Notifications<br>You must be signed in to change notification settings
Fork<br>16
Star<br>86
GPT-2 124M pretrained on OpenWebText (56k steps)
Latest
Latest
Compare
Choose a tag to compare
Sorry, something went wrong.
Filter
Loading
Sorry, something went wrong.
Uh oh!
There was an error while loading. Please reload this page.
No results found
View all tags
workofart
released this
11 Jun 03:48
·
2 commits
to main<br>since this release
gpt2-124m-openwebtext-56000
c576280
This commit was created on GitHub.com and signed with GitHub’s verified signature .
GPG key ID: B5690EEEBB952194
Verified
Learn about vigilant mode.
GPT-2 124M — OpenWebText Baseline Model Card
A 124M-parameter GPT-2 trained from scratch on OpenWebText data using a hand-written deep learning library (no PyTorch in the model or training path).
Training Metrics
Metric<br>Value
Validation loss (cross-entropy, nats)<br>2.764
Validation perplexity (exp(loss))<br>15.87
Bits per token (loss / ln 2)<br>3.99
Steps trained<br>56,000 (of 600,000 planned)
Tokens seen<br>~27.5B (491,520 tok/step)
Start -> end val loss<br>5.18 -> 2.76
Zero-shot evaluation
Zero-shot evaluation results for checkpoint openwebtext_gpt2_124m_baseline_GPT2_56000 via lm-evaluation-harness.
bits_per_byte, byte_perplexity, and word_perplexity are normalized by bytes/words rather than tokens, so they are tokenizer-independent and can be compared directly against other GPT-2 models despite the custom BPE (see Tokenizer caveat section below). acc is also comparable
Caveats:
The BPE tokenizer size matches GPT-2, per-token perplexity is broadly comparable, but the BPE merges (and therefore exact token boundaries) are trained from scratch which can be different from the official GPT-2 tokenizer, so it is not an apples-to-apples identical-tokenization comparison. For a fully tokenizer-independent number, use bits-per-byte
the LAMBADA perplexity is token-level and so carries the usual tokenizer dependence
Task<br>Metric<br>Direction(↑ = higher is better, ↓ = lower is better)<br>Value<br>± Stderr
CBT-CN<br>acc<br>0.3952<br>0.0098
CBT-NE<br>acc<br>0.4052<br>0.0098
enwik8<br>bits_per_byte<br>1.8399
lambada_openai<br>acc<br>0.2989<br>0.0064
perplexity<br>52.7521<br>2.1696
1BW<br>word_perplexity<br>135.6374
PTB<br>word_perplexity<br>827.3800
text8<br>bits_per_byte<br>1.3039
WikiText103<br>bits_per_byte<br>1.0037
byte_perplexity<br>2.0052
word_perplexity<br>41.2833
Architecture (GPT-2 Small, 124 million parameter)
Layers / heads / hidden<br>12 / 12 / 768
Max sequence length<br>1024
Vocab size<br>50,257 (custom byte-level BPE; logits padded to 50,304 for efficient training on multiples of 64)
Dropout<br>0.0
Parameter dtype<br>bfloat16
Notable<br>packed QKV projection
The tokenizer is a custom byte pair encoder (BPE) trained from scratch on OpenWebText (49,990 merges, 50,257 tokens , the same vocab size as OpenAI's GPT-2 BPE). The model's output layer is padded to 50,304 (next multiple of 64) for efficiency; the extra 47 logit rows are unused.
Training configuration
Dataset<br>OpenWebText
Optimizer<br>AdamW (lr 6e-4, beta 0.95, weight_decay 0.1)
LR schedule<br>cosine + 1,000-step warmup, min_lr 1e-4, decay over 600k steps
Grad clipping<br>max-norm 1.0
Global batch<br>480 sequences (micro 60 × 8 grad-accum)
Tokens / step<br>491,520
Eval<br>mean val loss over 100 batches
Limitations
Research / educational artifact demonstrating a deep learning library created from scratch can train GPT-2 to a reasonable level close to the GPT-2 small checkpoint from huggingface
Undertrained relative to its own schedule (56k/600k steps)
Assets
Loading
Uh oh!
There was an error while loading. Please reload this page.
-->
All reactions
You can’t perform that action at this time.