Understanding the Linux Kernel: The Scheduler

valyala1 pts0 comments

The Scheduler | Internals for InternsπŸ“š<br>Understanding the Linux Kernel (4 of 4)<br>β–Ό1.<br>The Linux Kernel Startup<br>2.<br>System Calls<br>3.<br>Memory Manager<br>4.<br>The Scheduler<br>You are here

In the previous article<br>we looked at how the kernel gives every process its own private view of memory. But memory is only half of what a process needs to actually run. The other half is the CPU itself β€” and there are only so many CPUs in a machine, while there are usually hundreds or thousands of things that want to run on them.<br>So somebody has to decide, constantly, who gets a CPU and for how long. That somebody is the scheduler . Every few milliseconds, on every core, the kernel asks itself the same question β€” of everything that wants to run right now, who runs next? β€” and the answer has to be fast, fair, and good enough that your text editor stays responsive even while a compile is pegging every core.<br>Let&rsquo;s take it step by step, because the scheduler has a lot of moving parts. We&rsquo;ll start with what the scheduler is even scheduling β€” what a process and a thread really are under the hood. Then we&rsquo;ll see that Linux doesn&rsquo;t have just one scheduler but several, stacked as scheduling classes. From there we&rsquo;ll look at when a running task stops running (it&rsquo;s more interesting than it sounds), and look at what it costs to switch from one task to another. And finally we&rsquo;ll get to the heart of it: how the kernel actually decides who&rsquo;s next, using an algorithm called EEVDF .<br>πŸ“Œ A note on scope<br>Everything here is against Linux 7.1 (the scheduler core lives in kernel/sched/, mostly in fair.c and core.c). And it&rsquo;s a deliberate simplification: the real implementation has far more going on β€” load balancing across CPUs, group scheduling through cgroups, CPU bandwidth control, NUMA awareness, and countless edge cases β€” that I&rsquo;m skipping over to keep the core ideas clear.

So let&rsquo;s begin where the scheduler itself begins: with the thing it&rsquo;s actually shuffling on and off the CPU.<br>What the Scheduler Actually Schedules<br>Here&rsquo;s the first surprise, and it clears up a lot of confusion: the kernel doesn&rsquo;t schedule &ldquo;processes&rdquo; or &ldquo;threads.&rdquo; Those are words we use up in user space. Down in the kernel there&rsquo;s just one kind of schedulable thing β€” a task_struct (include/linux/sched.h:820), the kernel&rsquo;s record of one flow of execution, one thing that can sit on a CPU and run instructions.<br>What you call a process and what you call a thread are both, underneath, the exact same kind of object. The only difference is what they share . When you fork() a new process, you get a fresh task_struct that shares nothing with its parent β€” its own memory, its own file descriptors, its own everything. When you create a thread (with clone() and the right flags) you also get a fresh task_struct, except this one shares the address space, the open files, the signal handlers, and so on with the task that spawned it. So a &ldquo;multi-threaded process&rdquo; is really just a bunch of task_structs that happen to point at the same memory.

From the scheduler&rsquo;s chair none of that matters anyway β€” it doesn&rsquo;t know or care who shares what. It just sees a pile of task_structs, some runnable and some not, and picks among the runnable ones. Everything runnable on a CPU is a task_struct, full stop.<br>Now, a task_struct is huge β€” it holds basically everything the kernel knows about a task β€” but the scheduler only cares about a sliver of it. Tucked inside every task is a small embedded bundle of scheduling state (called the sched_entity, include/linux/sched.h:575), and that little bundle β€” not the giant struct around it β€” is what the scheduler actually reasons about.<br>It&rsquo;s where the interesting numbers live, things with names like vruntime, vlag, deadline, and slice. Don&rsquo;t worry about what any of those mean yet β€” unpacking them is basically the rest of this article. For now just hold onto the shape of it: each runnable task carries a small wad of accounting state, and that&rsquo;s the part the scheduler reads when it decides who runs next.<br>But &ldquo;the scheduler&rdquo; is a bit of a white lie, because there isn&rsquo;t just one β€” so before we go further, let&rsquo;s see who actually gets handed the decision.<br>Scheduling Classes: Who Even Gets Asked First<br>Before we get to the algorithm, there&rsquo;s a twist: Linux doesn&rsquo;t actually have one scheduler. It has several, and they&rsquo;re stacked in a strict pecking order. These stacked schedulers are called scheduling classes , and each one is a self-contained policy for a different kind of workload.<br>The way they cooperate is dead simple. When a CPU needs something to run, the kernel goes down the stack from the top and asks each class in turn, &ldquo;got anything runnable?&rdquo; The first one to say yes wins, and everything below it doesn&rsquo;t even get a vote. So a class only ever runs a task when...

rsquo scheduler kernel linux actually task

Related Articles