Swap tables, flash-friendly swap, swap_ops, and more [LWN.net]
LWN<br>.net<br>News from the source
Content Weekly Edition<br>Archives<br>Search<br>Kernel<br>Security<br>Events calendar<br>Unread comments
LWN FAQ<br>Write for us
User:<br>Password: |
Log in /<br>Subscribe /<br>Register
Swap tables, flash-friendly swap, swap_ops, and more
[LWN subscriber-only content]
How do you stay on top of kernel development?
LWN is the only outlet providing coverage of Linux kernel development from the inside. Beyond immediate access to all content, LWN subscribers<br>get a number of benefits, including access to the LWN Kernel Source Database, and they provide the crucial support that keeps this unique coverage alive.
GIve LWN a try : get a one-month free trial subscription, no obligations, no tricks, no credit card required.
Proceed to the article
By Jonathan Corbet<br>May 18, 2026
LSFMM+BPF
The kernel's swap subsystem is charged with managing anonymous pages in<br>secondary storage when those pages are (hopefully) not being used and the<br>memory they occupy is needed elsewhere. This long-unloved subsystem has<br>seen a resurgence of developer interest in recent times, so it is not<br>surprising that it was the topic of three separate sessions in the<br>memory-management track at the<br>2026 Linux Storage,<br>Filesystem, Memory Management, and BPF Summit. Two of those sessions<br>were concerned with improving the performance and maintainability of the<br>swap code, while one (shared with the storage track) was about how swapping<br>could be friendlier to solid-state storage devices.
Status and roadmap
The first session was a breakneck-paced presentation from Kairui Song on<br>recent changes in the swap subsystem and what is coming next. Song began<br>by describing his work introducing the swap table and removing a lot of<br>swap-subsystem complexity; see this<br>article and its successor for details<br>on this work. Before his changes were merged for 7.0, the swap subsystem<br>incurred an overhead of between three and 11 bytes per page; that<br>overhead is now reduced to between two and ten bytes. That news was greeted by<br>applause in the room.
Song is not done, though; he intends to cut the static overhead to zero<br>bytes, albeit still with a maximum of ten. His goal to cap that overhead<br>at eight bytes will not be realized in the short term because refault<br>tracking for the memory<br>resource controller requires more data. In the long term,<br>he still hopes to cut the maximum overhead to three bytes per page.
The need for some operations to bypass the swap cache has been removed, and<br>most of the swap-oriented helpers are now folio based. Most operations<br>only need the folio lock now; there are opportunities, he said, to optimize<br>further by applying some lockless algorithms. Work to unify folio<br>allocation with the swap cache is still in progress. Currently, anonymous<br>and shared-memory folios come with their own allocation logic that may<br>bypass readahead; he described this code as a long, complex, and racy fallback<br>loop. He is working to replace it with a single allocation helper.
Other work is aimed at letting the system make better use of the swap<br>cache; better readahead support is an important step in that direction.<br>The zram<br>subsystem can take advantage of it now but, he said, whether that is<br>beneficial is not entirely clear. It may be that zram is fast enough<br>already.
Swapping I/O is asynchronous and takes time; that means that there can be a<br>long delay between the onset of memory pressure and the completion of the<br>I/O that allows that pressure to be relieved. By the time that happens, it<br>may turn out that the system has overshot and swapped out more pages than<br>really needed. This could be helped by immediately dropping pages from the<br>swap cache once writeout has completed. He is not sure why that is not<br>always done now; more research is needed there.
There are a number of other problems yet to be solved. Swapping of<br>PMD-level huge pages is not as efficient as it could be. Readahead can end<br>up bringing in pages used for hibernation, which is wasteful but not a huge<br>problem, though the workaround is ugly. He is contemplating adding a<br>special bit to mark pages reserved for hibernation. There are users who<br>would like to be able to resize swap areas on the fly; that should be<br>practical to implement now.
Another problem arises when both anonymous and shared-memory (shmfs) folios<br>are swapped to the same device. If shmfs-backed transparent huge pages<br>(THPs) are being swapped, they can end up overlapping an anonymous page's<br>slot; when that happens now, the offending folio is simply dropped. The<br>problem will worsen, though, if readahead gains support for THPs. He is<br>contemplating creating a new swap-table type to address this problem.<br>Matthew Wilcox said the problem may come down to a confusion of logical<br>(within the owning process's address space) and physical readahead; we are<br>doing something wrong somewhere, he suggested.
Song is looking into compaction of the swap table. The system manages...