Making a vintage LLM from scratch - Cr;Lf;
Making a vintage LLM from scratch
2026 May 25, Mon
50 min
In this blog post, I will share the adventures I had creating my own LLM, from (almost) scratch, trained only on old texts.<br>I made my own base-training and fine-tuning scripts, data processing pipelines and custom datasets.<br>("almost from scratch" means I did use existing programming languages and libraries, I didn't write in Assembly, just like anyone else who builds an AI "from scratch"...)
The model can be found on HuggingFace: https://huggingface.co/croqaz/vintage-LLM-340m-v1-base ;<br>All the code is open source at: https://github.com/croqaz/vintage-LLM ;<br>If you want to check bigger Vintage models, see my previous post: Vintage LLM models.
The idea
Three months ago at the end of February I discovered a few Reddit posts by Hayk Grigorian, where he described creating his temporal gated language model. I was absolutely fascinated.
Training an LLM only on 1800s London texts, 90GB dataset:<br>https://reddit.com/r/LocalLLaMA/comments/1pkpsee/training_an_llm_only_on_1800s_london_texts_90gb
LLM trained from scratch on only 1800s London texts brings up a real protest from 1834:<br>https://reddit.com/r/LocalLLaMA/comments/1mvnmjo/my_llm_trained_from_scratch_on_only_1800s_london
Obviously I read other posts from other people that made their own LLMs, but maybe I wasn't ready to do it myself, or the model they were working on wasn't that interesting. Anyway, the thought of having my own Victorian chat bot... fuckin' epic !!
Since then, I worked on my own "Vintage LLM" every single day. Without exceptions. Even when I was sick.
In the meantime, a lot more historic LLMs have been released like: Violet-1B4-Chat, Mr. Chatterbox, GPT-1900, Talkie and TypewriterLM-base.
What, why, where and how?
What?<br>This is a time-locked LLM/ historical LLM, English only, and its knowledge cutoff is year 1900.<br>(Limiting to a specific year is error prone, but I did my best effort).<br>It is based on Llama architecture and has 340M (0.3B) params.
Why?<br>Because I can only learn if I do it myself and it's a super fun project.
Where and how?<br>I made my own dataset, my own processing and training code.<br>The code is semi-vibe-coded with whatever LLM I had with VS-Code and PI (OpenRouter models).<br>I checked and validated every single function and I deeply understand what every single code file is doing.<br>The dataset processing took the most and I tried all sorts of things that didn't work, and I wasted a ton of time. Complicated solutions are the worst...
I processed all the data on my own PC and I trained smaller versions of the LLM on my PC (Cachy OS Linux, AMD Ryzen 7 9700X CPU, 64GB RAM, Radeon RX 9070 16GB VRAM).<br>As for the larger 340M model, I trained it on RunPod, ThunderCompute and Vast.ai. It would have taken forever on my PC.
The total cost of this project was: ~$80, GPU costs only.<br>That's because I have a decent PC to process the data. If I had more RAM, I could have processed some of the data much faster, especially when it comes to de-duplicating texts in memory.
Disclaimer : This is a toy/ hobby LLM (but I treat it very seriously).<br>It will hallucinate and generate historic semi-accurate content which, at the time was considered normal but by today's standards is considered: toxic, offensive and unsafe. This is expected, because I didn't do any alignment. Aligning (or censoring) the model requires significant effort and it would ruin the historic accuracy.<br>Also, I can't guarantee that my model is strictly limited to the year 1900 (even if I did my best) eg: as to perform the "Albert Einstein test".
The plan
I use AI everyday at work and I understand how it works, but I never built an LLM myself. I ran specific AI training and fine-tuning pipelines at work, I built tiny neural networks in C and Python in the past, but when I started this project I didn't know how people are usually building LLMs.
I searched for a week and I chatted with multiple bots to get different points of view (like I always do when I research a topic).
In short, to build an LLM you need 4 things:
the data -- an LLM has no discernment or understanding. It will learn from anything you tell it to, good or bad. This is the longest process.
tokenization -- the Tokenizer is a little program that converts words or letters into numbers (tokens). LLMs don't understand words, they only understand numbers.
pre-training -- it's a confusing expression and it means "base-training", where the LLM learns to autocomplete text. If you're going for a 300m+ params, this is the most expensive process.
fine-tuning -- where the LLM learns how to chat in turns, question & answer.
Well, there's a bit more to it than these simple steps, but I won't go super deep in this article.
Now, let's look at each step in more detail.
Initial experiments
It's worth mentioning that I made lots of mistakes and I experimented with some datasets and model architectures before I settled...