Making a Vintage LLM from Scratch

Making a vintage LLM from scratch - Cr;Lf;

Making a vintage LLM from scratch

2026 May 25, Mon

50 min

In this blog post, I will share the adventures I had creating my own LLM, from (almost) scratch, trained only on old texts. I made my own base-training and fine-tuning scripts, data processing pipelines and custom datasets. ("almost from scratch" means I did use existing programming languages and libraries, I didn't write in Assembly, just like anyone else who builds an AI "from scratch"...)

The model can be found on HuggingFace: https://huggingface.co/croqaz/vintage-LLM-340m-v1-base ; All the code is open source at: https://github.com/croqaz/vintage-LLM ; If you want to check bigger Vintage models, see my previous post: Vintage LLM models.

The idea

Three months ago at the end of February I discovered a few Reddit posts by Hayk Grigorian, where he described creating his temporal gated language model. I was absolutely fascinated.

Training an LLM only on 1800s London texts, 90GB dataset: https://reddit.com/r/LocalLLaMA/comments/1pkpsee/training_an_llm_only_on_1800s_london_texts_90gb

LLM trained from scratch on only 1800s London texts brings up a real protest from 1834: https://reddit.com/r/LocalLLaMA/comments/1mvnmjo/my_llm_trained_from_scratch_on_only_1800s_london

Obviously I read other posts from other people that made their own LLMs, but maybe I wasn't ready to do it myself, or the model they were working on wasn't that interesting. Anyway, the thought of having my own Victorian chat bot... fuckin' epic !!

Since then, I worked on my own "Vintage LLM" every single day. Without exceptions. Even when I was sick.

In the meantime, a lot more historic LLMs have been released like: Violet-1B4-Chat, Mr. Chatterbox, GPT-1900, Talkie and TypewriterLM-base.

What, why, where and how?

What? This is a time-locked LLM/ historical LLM, English only, and its knowledge cutoff is year 1900. (Limiting to a specific year is error prone, but I did my best effort). It is based on Llama architecture and has 340M (0.3B) params.

Why? Because I can only learn if I do it myself and it's a super fun project.

Where and how? I made my own dataset, my own processing and training code. The code is semi-vibe-coded with whatever LLM I had with VS-Code and PI (OpenRouter models). I checked and validated every single function and I deeply understand what every single code file is doing. The dataset processing took the most and I tried all sorts of things that didn't work, and I wasted a ton of time. Complicated solutions are the worst...

I processed all the data on my own PC and I trained smaller versions of the LLM on my PC (Cachy OS Linux, AMD Ryzen 7 9700X CPU, 64GB RAM, Radeon RX 9070 16GB VRAM). As for the larger 340M model, I trained it on RunPod, ThunderCompute and Vast.ai. It would have taken forever on my PC.

The total cost of this project was: ~$80, GPU costs only. That's because I have a decent PC to process the data. If I had more RAM, I could have processed some of the data much faster, especially when it comes to de-duplicating texts in memory.

Disclaimer : This is a toy/ hobby LLM (but I treat it very seriously). It will hallucinate and generate historic semi-accurate content which, at the time was considered normal but by today's standards is considered: toxic, offensive and unsafe. This is expected, because I didn't do any alignment. Aligning (or censoring) the model requires significant effort and it would ruin the historic accuracy. Also, I can't guarantee that my model is strictly limited to the year 1900 (even if I did my best) eg: as to perform the "Albert Einstein test".

The plan

I use AI everyday at work and I understand how it works, but I never built an LLM myself. I ran specific AI training and fine-tuning pipelines at work, I built tiny neural networks in C and Python in the past, but when I started this project I didn't know how people are usually building LLMs.

I searched for a week and I chatted with multiple bots to get different points of view (like I always do when I research a topic).

In short, to build an LLM you need 4 things:

the data -- an LLM has no discernment or understanding. It will learn from anything you tell it to, good or bad. This is the longest process.

tokenization -- the Tokenizer is a little program that converts words or letters into numbers (tokens). LLMs don't understand words, they only understand numbers.

pre-training -- it's a confusing expression and it means "base-training", where the LLM learns to autocomplete text. If you're going for a 300m+ params, this is the most expensive process.

fine-tuning -- where the LLM learns how to chat in turns, question & answer.

Well, there's a bit more to it than these simple steps, but I won't go super deep in this article.

Now, let's look at each step in more detail.

Initial experiments

It's worth mentioning that I made lots of mistakes and I experimented with some datasets and model architectures before I settled...

Making a Vintage LLM from Scratch

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs