Unix GC Remastered

mananaysiempre1 pts0 comments

Unix GC Remastered – Moe's VR blog<br>Introduction<br>The AF_UNIX garbage collector is an interesting piece of the kernel. It exists because sockets can be sent with SCM_RIGHTS but they can become unreachable from user-space while still being kept alive by the kernel, which is not memory efficient; in this situation, the garbage collector intervenes to free them. Not long ago, the subsystem was rewritten from scratch on top of a graph/Strongly-Connected-Components model; but it is still bug prone.<br>This post walks the rewrite end-to-end, and discusses a Use-After-Free bug.<br>AF_UNIX Garbage Collector — Background<br>A per-subsystem garbage collector is responsible for reclaiming kernel objects that can no longer be reached through user-space handles. For AF_UNIX, the entry point is unix_gc():<br>static DECLARE_WORK(unix_gc_work, __unix_gc);

void unix_gc(void)<br>WRITE_ONCE(gc_in_progress, true);<br>queue_work(system_dfl_wq, &unix_gc_work);

Its real body is __unix_gc():<br>static void __unix_gc(struct work_struct *work)<br>struct sk_buff_head hitlist;<br>struct sk_buff *skb;

spin_lock(&unix_gc_lock);

if (!unix_graph_maybe_cyclic) {<br>spin_unlock(&unix_gc_lock);<br>goto skip_gc;

__skb_queue_head_init(&hitlist);

if (unix_graph_grouped)<br>unix_walk_scc_fast(&hitlist);<br>else<br>unix_walk_scc(&hitlist);

spin_unlock(&unix_gc_lock);

skb_queue_walk(&hitlist, skb) {<br>if (UNIXCB(skb).fp)<br>UNIXCB(skb).fp->dead = true;

__skb_queue_purge_reason(&hitlist, SKB_DROP_REASON_SOCKET_CLOSE);<br>skip_gc:<br>WRITE_ONCE(gc_in_progress, false);

The unix_sock structure<br>struct unix_sock {<br>/* WARNING: sk has to be the first member */<br>struct sock sk; /* inheritance */<br>struct unix_address *addr; /* bound name */<br>struct path path; /* filesystem path if bound */<br>struct mutex iolock, bindlock;<br>struct sock *peer; /* connected peer */<br>struct list_head link;<br>atomic_long_t inflight; /* [1] SCM_RIGHTS fd count */<br>/* ... */<br>struct sk_buff *oob_skb;<br>};

The critical field for GC is inflight ([1] ). A socket is &ldquo;in flight&rdquo; when its struct file * is riding as SCM_RIGHTS payload — sent by process A, not yet accepted by process B. Each time it is sent, inflight is incremented; each time it is received, inflight is decremented. The GC is looking for sockets for which file_count == inflight : the only remaining references are the ones trapped in other sockets&rsquo; receive queues, i.e. no user-space handle can ever reach them again.<br>The LWN &ldquo;AF_UNIX GC rework&rdquo; article puts it more concisely:<br>Let&rsquo;s say we send a fd of AF_UNIX socket A to B and vice versa and close() both sockets. When created, each socket&rsquo;s struct file initially has one reference. After the fd exchange, both refcounts are bumped up to 2. Then, close() decreases both to 1. From this point on, no one can touch the file/socket. However, the struct file has one refcount and thus never calls the release() function of the AF_UNIX socket. That&rsquo;s why we need to track all inflight AF_UNIX sockets and run garbage collection.

The kernel maintains a global unix_tot_inflight counter, incremented on every inflight transition and decremented on every accept.<br>When GC runs<br>There are two triggers:<br>Too many inflight sockets:<br>if (READ_ONCE(unix_tot_inflight) > UNIX_INFLIGHT_TRIGGER_GC &&<br>!READ_ONCE(gc_in_progress))<br>unix_gc();

(UNIX_INFLIGHT_TRIGGER_GC == 16000.)

A socket close, if anything is inflight:<br>static const struct proto_ops unix_stream_ops = {<br>.family = PF_UNIX,<br>.owner = THIS_MODULE,<br>.release = unix_release,<br>/* ... */<br>};

static void unix_release_sock(struct sock *sk, int embrion)<br>/* ... */<br>if (READ_ONCE(unix_tot_inflight))<br>unix_gc();

The Old GC<br>The pre-2024 collector is well described in the Google P0 post &ldquo;The quantum state of Linux kernel garbage collection&rdquo;, which covers both the algorithm and a 2021 Android in-the-wild exploit. That post is the recommended companion read; here is just the one-line summary: the old GC walked the inflight graph, marked cycles, and checked inflight != refcount to decide whether each cycle was collectable.<br>Here&rsquo;s a nice mermaid diagram:

The New GC<br>From the GC Rework announcement:<br>[It] replaces the current GC implementation that locks each inflight socket&rsquo;s receive queue and requires trickiness in other places. The new GC does not lock each socket&rsquo;s queue to minimise its effect and tries to be lightweight if there is no cyclic reference or no update in the shape of the inflight fd graph.

Graph representation<br>Each inflight socket becomes a vertex ; each backing struct file * carried in an SCM_RIGHTS cmsg becomes a directed edge (predecessor → successor).<br>Example — send A to C, C to D, B to D. Three inflight sockets (A, B, C — not D), giving the graph:

Tarjan&rsquo;s algorithm then partitions this graph into strongly connected components. Why SCCs? For any directed graph, any SCC of more than one vertex necessarily contains at least one cycle:

A cycle is a necessary but not sufficient condition for a vertex to be collectable:...

struct inflight socket rsquo af_unix sockets

Related Articles