Which flavor of software are LLMs exactly?

silverret1 pts0 comments

Which flavor of software are LLMs exactly? | Silvestre Perret Uncomment to have animation when navigating, see https://docs.astro.build/en/guides/view-transitions/ -->

Go back<br>Which flavor of software are LLMs exactly?<br>15 May, 2026<br>Preamble : Long time no see! It’s been work, work, work lately. I even went to Google Cloud Next 2026, which was a blast (expect a future blog post…).

In my previous blog post, I introduced the concept coined by Andrej Karpathy of Software 2.0 , more commonly known as Machine Learning (ML), a new software paradigm where human-written instructions are replaced by weights automatically tuned to get the desired behavior¹.

Let’s build upon that and discuss how Large Language Models (LLMs) fit this picture. LLMs are the foundation of what most people have called Artificial Intelligence (AI) for the last 4 years (yes Dad, even ChatGPT). They are technically a special kind of Neural Network (NN) and should fit nicely in the Software 2.0 category… but they don’t.

¹ let’s mention that the process to train these weights still relied on good ol’ Software 1.0 instructions.

1. The nice and cozy pre-LLM era.

Artificial Intelligence and Neural Networks have been around pretty much since the invention of computers². Neither are particularly new. Initially limited by the hardware capabilities and their tendencies not to behave nicely during their training, Neural Networks haven’t been the sharpest tools in the shed of Machine Learning techniques for years. But around 2012, three distinct trends met. First, the hardware DID finally catch up. Second, researchers discovered a few small tweaks in known techniques that greatly improved the training process of Neural Networks, making it faster and more stable. Third, as the internet grew, bigger datasets than ever before were put together. In the following years, Neural Network quickly gained a lot of interest and they become the state-of-the-art (a.k.a. the best) method to classify images, detecting objects or people on images, and much more.

But during that time, Neural Networks still respected most of the numerous “rules” of Machine Learning that practitioners either stole from statisticians (them again!) or painfully learned through experimentation. First, the Machine Learning models were trained specifically for the tasks they would be used for. If you wanted a Neural Network to detect squirrels eating your precious flowers, you needed to train³ a Neural Network to detect squirrels and a dataset to do so. Second, like other kinds of Machine Learning models, you had to limit the complexity of the model to match the amount of available training data. Indeed, Neural Networks are very good at memorizing data, and if you give them too much power (too many weights) for the amount of data they are trained on, they will just memorize the training data and not learn to generalize to new data. For our example, not generalizing would mean having a model only able to detect squirrels with exactly the same background, lighting and flowers as in the training data. Not super useful … Finding the right balance between model complexity and training data size was still important.

Until then, everything was rosy. The Neural Networks, while powerful, still behaved like the rest of the Machine Learning models. While there was a lot to learn and to adapt, ML practitioners (at least me) still felt more or less at home.

² According to Wikipedia, the first “artificial neurons” were designed in 1943, before computers were even a thing.

³ or fine-tune a Neural Network to detect squirrels, fine-tuning here means taking an existing Neural Network trained to do a similar (often harder) task, like recognize up to 21,841 different kind of animal and objects, and then re-tweaking its already-learned weights on your specific squirrel dataset. In order not to lose the initial performance of the trained model, the “re-tweaking” procedure is slightly different than a full training procedure but requires way less data and compute.

2. LLMs did not kindly knock at the door

Then, between 2017 and 2022, multiple breakthroughs occurred in succession and opened the new era we are in.

First and most famously, in 2017, a Google R&D team gave birth to a new Neural Network architecture called the Transformer (in what is now probably the most cited paper ever in the ML field). It’s hard to overstate the improvements this architecture brought: way easier to parallelize (important if you want to train on a lot of data) than alternatives, more stable to train than RNN or LSTM (the previous state-of-the-art for sequence modeling), extremely expressive (meaning it can learn all sorts of patterns in the data) while more parameter-efficient⁴. The initial paper applied this new architecture to translation tasks and featured 2 Neural Networks of relatively large scale of the time: ~65 million and ~213 million parameters, taking respectively 12 hours and 3.5 days to train on a small GPU...

neural data software network networks training

Related Articles