A survey of inlining heuristics | Max Bernstein
home<br>blog<br>microblog<br>favorites<br>pl resources<br>bread<br>recipes<br>rss
A survey of inlining heuristics
June 3, 2026
Compilers, especially method just-in-time compilers, operate on one function at<br>a time. It is a natural code unit size, especially for a dynamic language JIT:<br>at a given point in time, what more information can you gather about other<br>parts of a running, changing system?
I don’t have any data to back this up—maybe I should go gather some—but on<br>average, methods are small. Especially in languages such as Ruby that use<br>method dispatch for everything, even instance variable (attribute, field, …)<br>lookups, they are small. And everywhere.
This makes the compiler sad. If we are to continue to anthropomorphize them,<br>compilers like having more context so they can optimize better. Consider the<br>following silly-looking example that is actually representative of a surprising<br>amount of real-world code:
class Point<br>attr_reader :x, :y
def initialize(x, y)<br>@x = x<br>@y = y<br>end
def distance(other)<br>Math.sqrt((@x - other.x)**2 + (@y - other.y)**2)<br>end<br>end
def distance_from_origin(x, y)<br>Point.new(x, y).distance(Point.new(0, 0))<br>end
Right now, in the distance_from_origin method, I count 8 different method calls:
Point.new
Point#initialize
Point.new
Point#initialize
Point#distance
Float#**
Float#**
Math.sqrt
(Technically more, but the ivar lookups (including attr_reader!), addition,<br>and subtraction are generally specialized and don’t push a frame, even in the<br>interpreter.)
Furthermore, there are at least two heap allocations: one for each Point<br>instance.
Last, there is a bunch of memory traffic to and from Point instances.
This all is a huge bummer! What should be a simple math operation is now<br>overwhelmed with a bunch of other stuff. Point is certainly not a zero-cost<br>abstraction.
Even if we had a bunch of other optimizations such as load-store elimination or<br>escape analysis, they would not be able to do much: pretty much everything<br>escapes and is effectful. That is, unless we inline. Inlining is the lever<br>that enables a bunch of other optimization passes to kick in.
Inlining: the “easy” part
I wrote about the design and implementation of Cinder’s inliner (FB<br>link,<br>personal blog link) a couple of years ago. I wrote<br>about arguably the simplest part, which is copying the callee body into the<br>caller. It took me at least a week to get working. Probably closer to months if<br>you consider all the plumbing through the rest of the JIT. In February during a<br>small hackathon, I watched my colleague k0kubun<br>prototype that bit of the inliner inside ZJIT in about 30 minutes.
There is more to do when pretty much every part of the VM is observable from<br>the guest language: both Python and Ruby allow inspecting the state of the<br>locals, the call stack, etc from user code. Sampling profilers also expect some<br>amount of breadcrumbs to work with to inspect the stack. So there’s some more<br>machinery still required to pretend like the callee function was not inlined. I<br>talk about this a little bit in the Cinder blog post.
Even so, all of that can probably be designed and wired together in a couple<br>of months. Then you will find yourself tuning the inliner for the next 10<br>years. This is much harder.
When: the harder part
The thing that makes inlining difficult, especially in a method JIT, is that<br>you are trying to make an entire (dynamic!) system faster but you are only<br>looking through a microscope and only capable of local reasoning1.<br>Whereas other optimizations such as strength reduction, inline caches, and<br>value numbering are an un-alloyed good for the generated code, inlining can<br>have negative effects. It is also perhaps the first optimization people add<br>that has non-local impact.
If you inline wrong, your code size might blow up. This might thrash your CPU’s<br>caches. Bummer, but happens to the best of us.
But also, if you inline wrong, you might get in the way of other helpful<br>optimizations: if you hit some size limit after inlining method A, you might<br>never get to inline B, which is the key to unlocking the performance of the<br>method you are trying to optimize.
Last, inlining might hurt compile time. In situations where latency is<br>paramount (think: interactive client JavaScript), adding tons more code into<br>the fray might add noticeable hiccups, even if the long-term throughput<br>improves. As always, in-band compilation is a trade-off because any time you<br>spend compiling, you are not executing code.
You have to write your compiler to reason about all of this stuff. So you have<br>heuristics. For example, here is Michael Pollan’s inliner heuristic:
Inline methods. Mostly small. Not too many.
I did a survey of a bunch of compilers, mostly JIT compilers, to see what their<br>inlining heuristics look like. I also read (skimmed) some papers to see what<br>those folks had to say. I wonder if they agree.
This post was a long time coming. I started working on it about five years ago<br>but...