GitHub - zakirullin/gpt-go: Tiny GPT implemented from scratch in pure Go. Trained on Jules Verne books. Explained. · GitHub
/" data-turbo-transient="true" />
Skip to content
Search or jump to...
Search code, repositories, users, issues, pull requests...
-->
Search
Clear
Search syntax tips
Provide feedback
--><br>We read every piece of feedback, and take your input very seriously.
Include my email address so I can be contacted
Cancel
Submit feedback
Saved searches
Use saved searches to filter your results more quickly
-->
Name
Query
To see all available qualifiers, see our documentation.
Cancel
Create saved search
Sign in
/;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up
Appearance settings
Resetting focus
You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.
Dismiss alert
{{ message }}
zakirullin
gpt-go
Public
Notifications<br>You must be signed in to change notification settings
Fork<br>45
Star<br>639
main
BranchesTags
Go to file
CodeOpen more actions menu
Folders and files<br>NameNameLast commit message<br>Last commit date<br>Latest commit
History<br>391 Commits<br>391 Commits
data
data
pkg
pkg
.gitignore
.gitignore
LICENSE
LICENSE
README.md
README.md
block.go
block.go
go.mod
go.mod
go.sum
go.sum
head.go
head.go
layer.go
layer.go
main.go
main.go
main_test.go
main_test.go
View all files
Repository files navigation
gpt-go
Simple GPT implementation in pure Go. Trained on favourite Jules Verne books.
What kind of response you can expect from the model:
Mysterious Island.<br>Well.<br>My days must follow
Or this:
Captain Nemo, in two hundred thousand feet weary in<br>the existence of the world.
How to run
$ go run .
It takes about 40 minutes to train on MacBook Air M3. The trained weights will be saved to model-1.234M file. If you rerun the model, it will pick up the saved weights and continue training. The loss should decrease each time, indicating that the model is learning something useful.
You can train on your own dataset by pointing the data.dataset variable to your text corpus.
To run in chat-only mode once the training is done:
$ go run . -chat
How to understand
You can use this repository as a companion to the Neural Networks: Zero to Hero course. Use git checkout to see how the model has evolved over time: naive, bigram, multihead, block, residual, full.
In main_test.go you will find explanations starting from basic neuron example:
// Our neuron has 2 inputs and 1 output (number of columns in weight matrix).<br>// Its goal is to predict next number in the sequence.<br>input := V{1, 2} // {x1, x2}<br>weight := M{<br>{2}, // how much x1 contributes to the output<br>{3}, // how much x2 contributes to the output
All the way to self-attention mechanism:
// To calculate the sum of all previous tokens, we can multiply by this triangular matrix:<br>tril := M{<br>{1, 0, 0, 0}, // first token attends only at itself ("cat"), it can't look into the future<br>{1, 1, 0, 0}, // second token attends at itself and the previous token ( "cat" + ", ")<br>{1, 1, 1, 0}, // third token attends at itself and the two previous tokens ("cat" + ", " + "dog")<br>{1, 1, 1, 1}, // fourth token attends at itself and all the previous tokens ("cat" + ", " + "dog" + " and")<br>}.Var()<br>// So, at this point each embedding is enriched with the information from all the previous tokens.<br>// That's the crux of self-attention.<br>enrichedEmbeds := MatMul(tril, inputEmbeds)
Design choices
No batches.
I've given up the complexity of the batch dimension for the sake of better understanding. It's far easier to build intuition with 2D matrices, rather than with 3D tensors. Besides, batches aren't inherent to the transformer architecture. For better gradient smoothing gradient accumulation was tried. The effect was negligible, so it was removed as well.
Removed gonum.
The gonum.matmul gave us ~30% performance boost, but it brought additional dependency. We're not striving for maximum efficiency here, rather for radical simplicity. Current matmul implementation is quite effective, and it's only 40 lines of plain readable code.
Papers
You don't need to read them to understand the code :)
Attention Is All You Need
Deep Residual Learning
DeepMind WaveNet
Batch Normalization
Deep NN + huge data = breakthrough performance
OpenAI GPT-3 paper
Analyzing the Structure of Attention
Credits
Many thanks to Andrej Karpathy for his brilliant Neural Networks: Zero to Hero course.
Thanks to @itsubaki for his elegant autograd package.
About
Tiny GPT implemented from scratch in pure Go. Trained on Jules Verne books. Explained.
Resources
Readme
License
MIT license
Uh oh!
There was an error while loading. Please reload this page.
Activity
Stars
639<br>stars
Watchers
watching
Forks
45<br>forks
Report repository
Releases
tags
Packages
Uh oh!
There was...