How jemalloc cut our production memory by 47%Skip to main content
Table of contents
Refine's proofreading API spent most of April 2026 pinned against its 1 GiB<br>memory cap. The cause was not a leak in our code. It was glibc's allocator<br>retaining freed memory under a bursty asynchronous workload, and the fix was to<br>swap the allocator to jemalloc. This is the investigation that found it and the<br>production result.
47.5%<br>reduction in average production RSSAfter swapping the container allocator from glibc to jemalloc. Sustained across the first overnight cycle.
Executive summary
Refine's API is a FastAPI service on a single async uvicorn worker,<br>deployed to Azure Container Apps with replicas at 1 GiB each.<br>For most of April 2026 its memory chart sat at the cap: peaks pinned at 1 GiB,<br>average in the 850–1000 MiB band, the floor never dropping between bursts.
The investigation had two layers. An initial fix on April 21 — replacing a<br>four-worker Gunicorn process with a single async uvicorn worker — cut average<br>RSS by 620 MiB in an hour and resolved the immediate incident. It was the<br>correct architecture, but it only divided a deeper allocator problem by four.<br>The fragmentation returned at smaller scale within sixteen days.
The underlying cause was glibc's malloc holding freed pages after bursts of<br>concurrent work rather than returning them to the kernel. No malloc-tuning<br>environment variable closed the gap on this workload. Swapping the allocator to<br>jemalloc via LD_PRELOAD did: a controlled benchmark on the production image<br>showed a 5.7× difference in steady-state RSS between glibc and jemalloc under<br>identical load. In production, average RSS fell 47.5%, from 873 MB to 458 MB,<br>sustained across the first overnight cycle. The shipped change is a 14-line<br>Dockerfile diff.
The workload and the symptom
The API is a mostly-I/O workload: request validation, database and<br>Redis reads, blob uploads, and a fan-out of<br>long-running Server-Sent Events streams that push processing progress back to<br>the browser.
For most of April 2026 the production memory chart showed the same shape in any<br>30-day window: peaks pinned at the cap, average in the 850–1000 MiB band, and a<br>floor that never dropped between bursts. Each deploy reset RSS briefly; within a<br>day or two it was back at 95% of the cap.
The 30-day daily aggregate, in MiB:
Date range (2026)Peak/dayAvg/dayNotes04/06 → 04/21 16:00 UTC1085–1100957–104215 straight days at the cap04/21 17:00 UTC ↓457409−620 MiB step in one hour04/22 → 05/04580–760490–676Stable after a contributing fix05/05 → 05/06957–977718–844Climbing back to the cap
Left unaddressed, this pattern ends in an OOMKill on the next larger-than-usual<br>burst. Addressed incorrectly, it is masked for a few weeks and then returns.<br>Both happened.
The first fix reduced the symptom, not the cause
The first acute incident was on 2026-04-21 at 06:28 UTC , when Front Door's<br>origin-health alert fired:
OriginHealthPercentage = 84.44% on prod-origins (threshold:<br>The diagnostic number from that morning: a /health endpoint that does nothing<br>but return a small Pydantic object took 36 seconds to respond . That is not<br>endpoint slowness. When a no-op endpoint takes that long, the kernel's CFS<br>scheduler is throttling the entire cgroup.
The container at the time ran gunicorn --workers 4 on a 1 GiB<br>budget: four full Python/FastAPI processes per replica. Each worker reached ~250 MiB after init, four of them filled the<br>GiB exactly, and the container lived permanently at the OOM edge. When any<br>single worker saturated the vCPU, CFS throttled the whole cgroup; the idle<br>workers could not run Python bytecode either, so /health waited 36 seconds for<br>its scheduling slice. Probe failures cascaded into container kills, kills<br>cascaded into cold-starts of all four workers competing for the same<br>vCPU, and CPU ramped from 4% to 97% in 60 seconds — faster than KEDA's<br>40–60 s reactive-scaling cycle could absorb.
The same-day fix was deliberately small: drop Gunicorn and run plain uvicorn<br>with one async worker.
CMD ["uvicorn", "main:app", \<br>"--host", "0.0.0.0", \<br>"--port", "8000", \<br>"--timeout-keep-alive", "5", \<br>"--timeout-graceful-shutdown", "580", \<br>"--access-log"]<br>The reasoning was straightforward. Container Apps already supervises the<br>process, so Gunicorn's worker manager is redundant with replica restart. Async<br>uvicorn handles concurrency through the event loop; multi-worker only helps with<br>CPU-bound sync code or crash isolation, neither of which matched a<br>lightweight-reads-and-streams workload.
The deploy went out at 17:00 UTC. Average RSS dropped from 1030 MiB at 16:00 to<br>410 MiB at 17:00 — a −620 MiB step in one hour . The 67 ContainerTerminated<br>events with reason='ProbeFailure' over the previous 10 days fell to single<br>digits within 48 hours. The Front Door alert never fired again.
This was the correct architectural change, but it was not the fix to the<br>underlying bug. It removed a 4× multiplier on a deeper allocator problem....