-->
Lean Software Scaling Laws · Gwern.net
Skip to main content
Warning: JavaScript Disabled!
For support of key website features (link annotation popups/popovers & transclusions, collapsible sections, backlinks, tablesorting, image zooming, sidenotes etc.), you must enable JavaScript.
Research proposal for measuring how coding LLM perplexity scales with codebase context size, using Lean as a test case for whether formal languages have better predictability exponents and could lead to safer, more secure software worldwide.
2025-09-01–2026-06-23<br>finished<br>certainty: possible<br>importance: 7
Language Priors
Scale Failure
Lean Bottleneck
Predictability Proxy
Measurement Design
Crossover Forecast
Bias Controls
Context Learning
Failure Modes
Research idea: empirically measure the scaling of coding LLM perplexity over codebase size to estimate the scaling laws of ‘predictability’ by programming language or other factors. This should translate into overall security and safety.
We can measure this in contemporary LLMs expensively, by training from scratch and finetuning, or cheaply, by measuring perplexity over increasingly large context windows of source code.
Codebases, and programming languages, which have better exponents in their scaling laws will eventually become easier for LLMs to understand, fix, and write.
In particular, the Lean programming language likely has, with 2026-era LLMs, a worse baseline constant and total loss on existing codebases, but better scaling exponents. This would imply that implementations in Lean can eventually win and deliver large benefits in program correctness at global scale—and thus could help justify large-scale investments in rewriting existing codebases in Lean or paying for new Lean code, thereby improving global cybersecurity.
Coding LLMs are currently on track to produce most software in the near-future, despite being generally mediocre quality or outright insecure (with vibecoded software being especially bad). Future rewrites with coding LLMs may help, but are not guaranteed to happen or to plug as many holes as we need to be secure against pervasive cybersecurity LLM offensives. How can we avoid this? LLMs could potentially write all software in provably secure, safe ways like formally-verifiable systems, but progress in that lags behind.
How far behind?
Language Priors
Neural scaling law methodology remains under-applied in deep learning for validating existing approaches and forecasting future applications. An example is in coding agents: it’s commonly observed that LLMs are better at more common languages due to more available data, and Luo et al 2025 argues that programming is especially data-hungry, and thus there might be long-term ‘lock-in’ and upgrading to better technologies like Haskell or Rust or Lean will be impossible.
But this does not follow: being a popular language with a lot of training data only means that LLMs start off by default performing well. (Because it’s hard to disentangle a programming language from its ecosystem as a whole, here I will just refer to ‘languages’.)
Any corpus can be seen as a ‘prior’, with a certain exchange rate between languages where better Python skills will partially transfer to, say, Haskell (see Yang et al 2025). So it’s possible that a ‘rising tide will lift all boats’, and in fact, better coding LLMs might lead to a renaissance for obscurer languages when the LLMs get ‘good enough’, and programmers need no longer spend years relearning everything.
Scale Failure
It also doesn’t mean that they will keep performing well as codebases expand. Many things work well at small scales, but increasingly fail at scale. The ‘BDSM’ discipline of a language like Haskell can be infuriating and painful if you’re writing a quick throwaway program, but may become indispensable when you have a million lines of complex code. So, it may be easy for an LLM to write a little Python… but what about a lot? A short script is easy to write and has few gotchas, especially since an LLM can easily look at the whole thing in its context window. But what happens as the codebase gets larger and more complex and older and no longer fits into context windows? Can it keep the dynamic types and exceptions straight? What about that monkey-patching over in that dark corner? Or that dependency which changed runtime behavior during an upgrade? Because if it can’t, even a single error can be fatal and force a human programmer to spend weeks debugging subtle errors.
Perhaps the most extreme programming language which is plausible for implementing large software programs, in terms of runtime speed and programming language power, is Lean, which has been enjoying a sudden burst of popularity as people and LLMs discover its use beyond theorem-proving, for implementing real programs. Even things like a zlib rewrite in Lean!
Lean Bottleneck
We can now imagine rewriting software, and writing new software, in Lean, to eliminate...