GPT-2 124M checkpoint pre-trained on OpenWebText 27.5B tokens

Release GPT-2 124M pretrained on OpenWebText (56k steps) · workofart/ml-by-hand · GitHub

//releases/show" data-turbo-transient="true" />

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Clear

Search syntax tips

Provide feedback

--> We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

//releases/show;ref_cta:Sign up;ref_loc:header logged out"}" Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

workofart

ml-by-hand

Public

Notifications You must be signed in to change notification settings

Fork 16

Star 86

GPT-2 124M pretrained on OpenWebText (56k steps)

Latest

Compare

Choose a tag to compare

Sorry, something went wrong.

Filter

Sorry, something went wrong.

Uh oh!

There was an error while loading. Please reload this page.

No results found

View all tags

workofart

released this

11 Jun 03:48

2 commits

to main since this release

gpt2-124m-openwebtext-56000

c576280

This commit was created on GitHub.com and signed with GitHub’s verified signature .

GPG key ID: B5690EEEBB952194

Verified

Learn about vigilant mode.

GPT-2 124M — OpenWebText Baseline Model Card

A 124M-parameter GPT-2 trained from scratch on OpenWebText data using a hand-written deep learning library (no PyTorch in the model or training path).

Training Metrics

Metric Value

Validation loss (cross-entropy, nats) 2.764

Validation perplexity (exp(loss)) 15.87

Bits per token (loss / ln 2) 3.99

Steps trained 56,000 (of 600,000 planned)

Tokens seen ~27.5B (491,520 tok/step)

Start -> end val loss 5.18 -> 2.76

Zero-shot evaluation

Zero-shot evaluation results for checkpoint openwebtext_gpt2_124m_baseline_GPT2_56000 via lm-evaluation-harness.

bits_per_byte, byte_perplexity, and word_perplexity are normalized by bytes/words rather than tokens, so they are tokenizer-independent and can be compared directly against other GPT-2 models despite the custom BPE (see Tokenizer caveat section below). acc is also comparable

Caveats:

The BPE tokenizer size matches GPT-2, per-token perplexity is broadly comparable, but the BPE merges (and therefore exact token boundaries) are trained from scratch which can be different from the official GPT-2 tokenizer, so it is not an apples-to-apples identical-tokenization comparison. For a fully tokenizer-independent number, use bits-per-byte

the LAMBADA perplexity is token-level and so carries the usual tokenizer dependence

Task Metric Direction(↑ = higher is better, ↓ = lower is better) Value ± Stderr

CBT-CN acc 0.3952 0.0098

CBT-NE acc 0.4052 0.0098

enwik8 bits_per_byte 1.8399

lambada_openai acc 0.2989 0.0064

perplexity 52.7521 2.1696

1BW word_perplexity 135.6374

PTB word_perplexity 827.3800

text8 bits_per_byte 1.3039

WikiText103 bits_per_byte 1.0037

byte_perplexity 2.0052

word_perplexity 41.2833

Architecture (GPT-2 Small, 124 million parameter)

Layers / heads / hidden 12 / 12 / 768

Max sequence length 1024

Vocab size 50,257 (custom byte-level BPE; logits padded to 50,304 for efficient training on multiples of 64)

Dropout 0.0

Parameter dtype bfloat16

Notable packed QKV projection

The tokenizer is a custom byte pair encoder (BPE) trained from scratch on OpenWebText (49,990 merges, 50,257 tokens , the same vocab size as OpenAI's GPT-2 BPE). The model's output layer is padded to 50,304 (next multiple of 64) for efficiency; the extra 47 logit rows are unused.

Training configuration

Dataset OpenWebText

Optimizer AdamW (lr 6e-4, beta 0.95, weight_decay 0.1)

LR schedule cosine + 1,000-step warmup, min_lr 1e-4, decay over 600k steps

Grad clipping max-norm 1.0

Global batch 480 sequences (micro 60 × 8 grad-accum)

Tokens / step 491,520

Eval mean val loss over 100 batches

Limitations

Research / educational artifact demonstrating a deep learning library created from scratch can train GPT-2 to a reasonable level close to the GPT-2 small checkpoint from huggingface

Undertrained relative to its own schedule (56k/600k steps)

Assets

Uh oh!

There was an error while loading. Please reload this page.

-->

All reactions

You can’t perform that action at this time.

GPT-2 124M checkpoint pre-trained on OpenWebText 27.5B tokens

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

German ruling declares Google liable for false answers in AI Overviews