The Scheduler | Internals for Internsπ<br>Understanding the Linux Kernel (4 of 4)<br>βΌ1.<br>The Linux Kernel Startup<br>2.<br>System Calls<br>3.<br>Memory Manager<br>4.<br>The Scheduler<br>You are here
In the previous article<br>we looked at how the kernel gives every process its own private view of memory. But memory is only half of what a process needs to actually run. The other half is the CPU itself β and there are only so many CPUs in a machine, while there are usually hundreds or thousands of things that want to run on them.<br>So somebody has to decide, constantly, who gets a CPU and for how long. That somebody is the scheduler . Every few milliseconds, on every core, the kernel asks itself the same question β of everything that wants to run right now, who runs next? β and the answer has to be fast, fair, and good enough that your text editor stays responsive even while a compile is pegging every core.<br>Let’s take it step by step, because the scheduler has a lot of moving parts. We’ll start with what the scheduler is even scheduling β what a process and a thread really are under the hood. Then we’ll see that Linux doesn’t have just one scheduler but several, stacked as scheduling classes. From there we’ll look at when a running task stops running (it’s more interesting than it sounds), and look at what it costs to switch from one task to another. And finally we’ll get to the heart of it: how the kernel actually decides who’s next, using an algorithm called EEVDF .<br>π A note on scope<br>Everything here is against Linux 7.1 (the scheduler core lives in kernel/sched/, mostly in fair.c and core.c). And it’s a deliberate simplification: the real implementation has far more going on β load balancing across CPUs, group scheduling through cgroups, CPU bandwidth control, NUMA awareness, and countless edge cases β that I’m skipping over to keep the core ideas clear.
So let’s begin where the scheduler itself begins: with the thing it’s actually shuffling on and off the CPU.<br>What the Scheduler Actually Schedules<br>Here’s the first surprise, and it clears up a lot of confusion: the kernel doesn’t schedule “processes” or “threads.” Those are words we use up in user space. Down in the kernel there’s just one kind of schedulable thing β a task_struct (include/linux/sched.h:820), the kernel’s record of one flow of execution, one thing that can sit on a CPU and run instructions.<br>What you call a process and what you call a thread are both, underneath, the exact same kind of object. The only difference is what they share . When you fork() a new process, you get a fresh task_struct that shares nothing with its parent β its own memory, its own file descriptors, its own everything. When you create a thread (with clone() and the right flags) you also get a fresh task_struct, except this one shares the address space, the open files, the signal handlers, and so on with the task that spawned it. So a “multi-threaded process” is really just a bunch of task_structs that happen to point at the same memory.
From the scheduler’s chair none of that matters anyway β it doesn’t know or care who shares what. It just sees a pile of task_structs, some runnable and some not, and picks among the runnable ones. Everything runnable on a CPU is a task_struct, full stop.<br>Now, a task_struct is huge β it holds basically everything the kernel knows about a task β but the scheduler only cares about a sliver of it. Tucked inside every task is a small embedded bundle of scheduling state (called the sched_entity, include/linux/sched.h:575), and that little bundle β not the giant struct around it β is what the scheduler actually reasons about.<br>It’s where the interesting numbers live, things with names like vruntime, vlag, deadline, and slice. Don’t worry about what any of those mean yet β unpacking them is basically the rest of this article. For now just hold onto the shape of it: each runnable task carries a small wad of accounting state, and that’s the part the scheduler reads when it decides who runs next.<br>But “the scheduler” is a bit of a white lie, because there isn’t just one β so before we go further, let’s see who actually gets handed the decision.<br>Scheduling Classes: Who Even Gets Asked First<br>Before we get to the algorithm, there’s a twist: Linux doesn’t actually have one scheduler. It has several, and they’re stacked in a strict pecking order. These stacked schedulers are called scheduling classes , and each one is a self-contained policy for a different kind of workload.<br>The way they cooperate is dead simple. When a CPU needs something to run, the kernel goes down the stack from the top and asks each class in turn, “got anything runnable?” The first one to say yes wins, and everything below it doesn’t even get a vote. So a class only ever runs a task when...