Free-threading vs. the GIL in mod_WSGI 6.0.0 – Graham Dumpleton

rbanffy1 pts0 comments

Free-threading vs the GIL in mod_wsgi 6.0.0 - Graham Dumpleton

The previous post in this series walked through tuning WSGISwitchInterval to claw back throughput on a multi-threaded mod_wsgi daemon group whose workload is CPU-bound Python. Tightening the switch interval recovered most of the throughput a two-process, five-thread shape had lost compared with the ten-process, one-thread baseline. What it did not change was per-process CPU usage. Each process stayed pinned at about one core regardless of how the switch interval was tuned, because that ceiling is the GIL itself.

This post is about what happens when that ceiling is removed. PEP 703 free-threading provides a CPython build with no GIL at all, and mod_wsgi 6.0.0 exposes the opt-in for it through WSGIFreeThreading, the second directive covered in this series. I have rerun the same benchmark workload on a free-threaded Python with that directive on. The interesting metric is now CPU usage per process.

Why CPU usage is the new focus

Throughput and response time were the headline metrics for tracking the effect of switch interval changes. Both are still relevant here. But the comparison turns on something different now. With the GIL, threads in a process serialise on the lock, and the process consumes at most one core regardless of how many threads you give it. With free-threading there is no GIL, and the process can use as many cores as its threads can actually fill. If the workload is CPU-bound Python, CPU usage per process is what tells you whether the runtime is making real use of the hardware.

What disappears from the toolkit

GIL wait time was the central diagnostic in the switch-interval post. Under free-threading there is no GIL, so there is no GIL wait to measure. The histogram that showed the convoy bumps at multiples of the switch interval in the previous post goes flat by definition. What replaces it as positive evidence that the workload is genuinely parallel is the CPU usage number itself. Multiple cores per process is the new signal.

A reminder of what free-threading asks of you

The free-threading post earlier in this series went into detail on what free-threading actually requires. Briefly: it is a separate CPython build (typically named python3.14t on systems that distribute it), C extensions must declare Py_mod_gil = Py_MOD_GIL_NOT_USED or the runtime quietly re-enables the GIL for the whole process, and application code must handle concurrent execution correctly rather than being incidentally safe under GIL atomicity. None of that has changed since I wrote that post. The metrics below assume those prerequisites are satisfied and show what the upside looks like when they are.

The benchmark setup

The workload is the same as in the switch-interval post. Each request spends approximately 3 ms running Python code on the CPU, plus a 1 ms simulated wait, and returns a 1 KB response body. Concurrency 10, same host, same Apache, same WSGI handler. The only changes are that Python is the free-threaded build, mod_wsgi has WSGIFreeThreading On configured on the daemon group, and two configurations are exercised: two processes with five threads each (matching the comparison from the previous post), and one process with ten threads (a configuration that has no point under the GIL but lights up under free-threading).

Comparison: two processes, five threads each

WSGIDaemonProcess my-app processes=2 threads=5<br>WSGIFreeThreading On

Config<br>rpm<br>response<br>CPU/proc<br>CPU total<br>GIL p95

10 × 1 GIL (baseline)<br>134k<br>4 ms<br>0.66 cores<br>6.6 cores<br>none

2 × 5 GIL, default 5 ms<br>37k<br>16 ms<br>0.95 cores<br>1.9 cores<br>13 ms

2 × 5 GIL, 0.1 ms tuned<br>121k<br>5 ms<br>0.90 cores<br>1.8 cores<br>1 ms

2 × 5 free-threading<br>131k<br>4 ms<br>3.3 cores<br>6.6 cores<br>n/a

Throughput jumps to 131k rpm, almost matching the ten-process baseline of 134k, and well past what 0.1 ms switch interval tuning could achieve. Response time is back to 4 ms.

The row that has changed in character is CPU per process. Under the GIL each process was pinned at about one core no matter what we did with the switch interval. Free-threading lifts that ceiling, and each process is now consuming about 3.3 cores. Total CPU is 6.6 cores, the same as the ten-process baseline, but with one fifth the processes.

The GIL p95 column has no value to report any more. The histogram that showed contention bumps for missed switch-interval cycles is now flat. There is no GIL to schedule and no wait to measure.

Comparison: one process, ten threads

WSGIDaemonProcess my-app processes=1 threads=10<br>WSGIFreeThreading On

Under the GIL this configuration would not really make sense. The threads would all queue for the one GIL, the process would cap at about one core, and throughput would likely be cut by half or more compared with the ten-process baseline. The exact figure depends on the workload and how the switch interval is set, but the shape is clear: one process with ten threads on a CPU-bound workload is not a configuration the GIL...

process free cores threading threads switch

Related Articles