GPT-2 124M checkpoint pre-trained on OpenWebText 27.5B tokens

megadragon91 pts1 comments

Release GPT-2 124M pretrained on OpenWebText (56k steps) · workofart/ml-by-hand · GitHub

//releases/show" data-turbo-transient="true" />

Skip to content

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Search

Clear

Search syntax tips

Provide feedback

--><br>We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

Sign in

//releases/show;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

{{ message }}

workofart

ml-by-hand

Public

Notifications<br>You must be signed in to change notification settings

Fork<br>16

Star<br>86

GPT-2 124M pretrained on OpenWebText (56k steps)

Latest

Latest

Compare

Choose a tag to compare

Sorry, something went wrong.

Filter

Loading

Sorry, something went wrong.

Uh oh!

There was an error while loading. Please reload this page.

No results found

View all tags

workofart

released this

11 Jun 03:48

&middot;

2 commits

to main<br>since this release

gpt2-124m-openwebtext-56000

c576280

This commit was created on GitHub.com and signed with GitHub’s verified signature .

GPG key ID: B5690EEEBB952194

Verified

Learn about vigilant mode.

GPT-2 124M — OpenWebText Baseline Model Card

A 124M-parameter GPT-2 trained from scratch on OpenWebText data using a hand-written deep learning library (no PyTorch in the model or training path).

Training Metrics

Metric<br>Value

Validation loss (cross-entropy, nats)<br>2.764

Validation perplexity (exp(loss))<br>15.87

Bits per token (loss / ln 2)<br>3.99

Steps trained<br>56,000 (of 600,000 planned)

Tokens seen<br>~27.5B (491,520 tok/step)

Start -> end val loss<br>5.18 -> 2.76

Zero-shot evaluation

Zero-shot evaluation results for checkpoint openwebtext_gpt2_124m_baseline_GPT2_56000 via lm-evaluation-harness.

bits_per_byte, byte_perplexity, and word_perplexity are normalized by bytes/words rather than tokens, so they are tokenizer-independent and can be compared directly against other GPT-2 models despite the custom BPE (see Tokenizer caveat section below). acc is also comparable

Caveats:

The BPE tokenizer size matches GPT-2, per-token perplexity is broadly comparable, but the BPE merges (and therefore exact token boundaries) are trained from scratch which can be different from the official GPT-2 tokenizer, so it is not an apples-to-apples identical-tokenization comparison. For a fully tokenizer-independent number, use bits-per-byte

the LAMBADA perplexity is token-level and so carries the usual tokenizer dependence

Task<br>Metric<br>Direction(↑ = higher is better, ↓ = lower is better)<br>Value<br>± Stderr

CBT-CN<br>acc<br>0.3952<br>0.0098

CBT-NE<br>acc<br>0.4052<br>0.0098

enwik8<br>bits_per_byte<br>1.8399

lambada_openai<br>acc<br>0.2989<br>0.0064

perplexity<br>52.7521<br>2.1696

1BW<br>word_perplexity<br>135.6374

PTB<br>word_perplexity<br>827.3800

text8<br>bits_per_byte<br>1.3039

WikiText103<br>bits_per_byte<br>1.0037

byte_perplexity<br>2.0052

word_perplexity<br>41.2833

Architecture (GPT-2 Small, 124 million parameter)

Layers / heads / hidden<br>12 / 12 / 768

Max sequence length<br>1024

Vocab size<br>50,257 (custom byte-level BPE; logits padded to 50,304 for efficient training on multiples of 64)

Dropout<br>0.0

Parameter dtype<br>bfloat16

Notable<br>packed QKV projection

The tokenizer is a custom byte pair encoder (BPE) trained from scratch on OpenWebText (49,990 merges, 50,257 tokens , the same vocab size as OpenAI's GPT-2 BPE). The model's output layer is padded to 50,304 (next multiple of 64) for efficiency; the extra 47 logit rows are unused.

Training configuration

Dataset<br>OpenWebText

Optimizer<br>AdamW (lr 6e-4, beta 0.95, weight_decay 0.1)

LR schedule<br>cosine + 1,000-step warmup, min_lr 1e-4, decay over 600k steps

Grad clipping<br>max-norm 1.0

Global batch<br>480 sequences (micro 60 × 8 grad-accum)

Tokens / step<br>491,520

Eval<br>mean val loss over 100 batches

Limitations

Research / educational artifact demonstrating a deep learning library created from scratch can train GPT-2 to a reasonable level close to the GPT-2 small checkpoint from huggingface

Undertrained relative to its own schedule (56k/600k steps)

Assets

Loading

Uh oh!

There was an error while loading. Please reload this page.

-->

All reactions

You can’t perform that action at this time.

openwebtext tokenizer 124m from trained tokens

Related Articles