Tiny hackable CUDA language model implementation

markusheimerl1 pts0 comments

GitHub - markusheimerl/gpt: A generative pretrained transformer implementation · GitHub

/" data-turbo-transient="true" />

Skip to content

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Search

Clear

Search syntax tips

Provide feedback

--><br>We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

Sign in

/;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

{{ message }}

markusheimerl

gpt

Public

Notifications<br>You must be signed in to change notification settings

Fork

Star

main

BranchesTags

Go to file

CodeOpen more actions menu

Folders and files<br>NameNameLast commit message<br>Last commit date<br>Latest commit

History<br>540 Commits<br>540 Commits

.github/workflows

.github/workflows

transformer

transformer

.gitignore

.gitignore

LICENSE

LICENSE

Makefile

Makefile

README.md

README.md

gpt.c

gpt.c

gpt.h

gpt.h

infer.c

infer.c

train.c

train.c

View all files

Repository files navigation

gpt

A generative pretrained transformer implementation

This project implements an autoregressive sequence model using a transformer architecture. The model processes sequences of bytes (8-bit tokens), learning to predict the next byte given previous context. While this implementation trains on text data, the architecture is agnostic to the content. It can model any byte stream, including, but not limited to, DNA/RNA sequences, compressed data, images, audio, video, or executable binaries.

The architecture begins with a token embedding layer that converts each byte into a continuous vector representation.

The core of the model is a multi-layer transformer that processes the embedded sequences. Each transformer layer consists of two main components: a causal self-attention mechanism and a feed-forward network, both wrapped with residual connections. The causal attention ensures that predictions for each position can only depend on previous positions, which is essential for autoregressive generation. The attention mechanism computes query, key, and value projections, applies rotational positional encoding to the queries and keys to encode relative positions, computes scaled dot-product attention with a causal mask, and projects the result back. The feed-forward network applies two linear transformations with a swish activation, a smooth, non-monotonic function that multiplies its input by its sigmoid, in between.

After processing through all transformer layers, a linear projection maps the final hidden states to logits over the vocabulary (all 256 possible byte values). These logits are converted to probabilities using the softmax function, and the model is trained to maximize the probability of the correct next byte using cross-entropy loss.

The training process uses the AdamW optimizer, which enhances the standard Adam optimizer by decoupling weight decay from the gradient-based update. AdamW maintains exponential moving averages of both gradients and squared gradients, using these to adapt the learning rate for each parameter individually. The weight decay acts as L2 regularization, encouraging the model to use smaller weights and improving generalization.

The implementation uses BLAS (Basic Linear Algebra Subprograms) for efficient matrix operations, allowing the model to train effectively on modern hardware.

How to run

Ubuntu

sudo apt update<br>sudo apt install -y clang make time libopenblas-dev nvidia-cuda-toolkit git curl<br>git clone https://github.com/markusheimerl/gpt && cd gpt/<br>make data<br>make run -j 6<br>make infer

Sample outputs

Prompted with "Once upon a time, there was a":

Once upon a time, there was a little girl named Mia. Mia loved to study with her toys. She had a big box full of toys in her room. One day, Mia found a new toy. The toy was a small doll. The doll had a pretty dress and smiled a little.<br>Mia took the doll outside to play. She studied hard and felt the dress on her f<br>markus@thinkpad:~/gpt$ make infer<br>Loaded: d_model=512 hidden=1024 layers=16 vocab=256 seq_len=1024<br>Generating 995 tokens (T=0.70, seed=1779612665)<br>Once upon a time, there was a little boy named Tim. Tim had a big tree in his yard. He loved to run and play in the tree. One day, he saw a perfect bird in his yard. The bird was sad because it could not find its mom.<br>Tim wanted to help the bird. He kneeled down and looked all around. He saw a little girl named Sue. Sue was playing with a ball. Tim asked her, "How can I be like your bird?" Sue...

model transformer implementation github search byte

Related Articles