Jemalloc cut our production memory by 47%

How jemalloc cut our production memory by 47%Skip to main content

Table of contents

Refine's proofreading API spent most of April 2026 pinned against its 1 GiB memory cap. The cause was not a leak in our code. It was glibc's allocator retaining freed memory under a bursty asynchronous workload, and the fix was to swap the allocator to jemalloc. This is the investigation that found it and the production result.

47.5% reduction in average production RSSAfter swapping the container allocator from glibc to jemalloc. Sustained across the first overnight cycle.

Executive summary

Refine's API is a FastAPI service on a single async uvicorn worker, deployed to Azure Container Apps with replicas at 1 GiB each. For most of April 2026 its memory chart sat at the cap: peaks pinned at 1 GiB, average in the 850–1000 MiB band, the floor never dropping between bursts.

The investigation had two layers. An initial fix on April 21 — replacing a four-worker Gunicorn process with a single async uvicorn worker — cut average RSS by 620 MiB in an hour and resolved the immediate incident. It was the correct architecture, but it only divided a deeper allocator problem by four. The fragmentation returned at smaller scale within sixteen days.

The underlying cause was glibc's malloc holding freed pages after bursts of concurrent work rather than returning them to the kernel. No malloc-tuning environment variable closed the gap on this workload. Swapping the allocator to jemalloc via LD_PRELOAD did: a controlled benchmark on the production image showed a 5.7× difference in steady-state RSS between glibc and jemalloc under identical load. In production, average RSS fell 47.5%, from 873 MB to 458 MB, sustained across the first overnight cycle. The shipped change is a 14-line Dockerfile diff.

The workload and the symptom

The API is a mostly-I/O workload: request validation, database and Redis reads, blob uploads, and a fan-out of long-running Server-Sent Events streams that push processing progress back to the browser.

For most of April 2026 the production memory chart showed the same shape in any 30-day window: peaks pinned at the cap, average in the 850–1000 MiB band, and a floor that never dropped between bursts. Each deploy reset RSS briefly; within a day or two it was back at 95% of the cap.

The 30-day daily aggregate, in MiB:

Date range (2026)Peak/dayAvg/dayNotes04/06 → 04/21 16:00 UTC1085–1100957–104215 straight days at the cap04/21 17:00 UTC ↓457409−620 MiB step in one hour04/22 → 05/04580–760490–676Stable after a contributing fix05/05 → 05/06957–977718–844Climbing back to the cap

Left unaddressed, this pattern ends in an OOMKill on the next larger-than-usual burst. Addressed incorrectly, it is masked for a few weeks and then returns. Both happened.

The first fix reduced the symptom, not the cause

The first acute incident was on 2026-04-21 at 06:28 UTC , when Front Door's origin-health alert fired:

OriginHealthPercentage = 84.44% on prod-origins (threshold: The diagnostic number from that morning: a /health endpoint that does nothing but return a small Pydantic object took 36 seconds to respond . That is not endpoint slowness. When a no-op endpoint takes that long, the kernel's CFS scheduler is throttling the entire cgroup.

The container at the time ran gunicorn --workers 4 on a 1 GiB budget: four full Python/FastAPI processes per replica. Each worker reached ~250 MiB after init, four of them filled the GiB exactly, and the container lived permanently at the OOM edge. When any single worker saturated the vCPU, CFS throttled the whole cgroup; the idle workers could not run Python bytecode either, so /health waited 36 seconds for its scheduling slice. Probe failures cascaded into container kills, kills cascaded into cold-starts of all four workers competing for the same vCPU, and CPU ramped from 4% to 97% in 60 seconds — faster than KEDA's 40–60 s reactive-scaling cycle could absorb.

The same-day fix was deliberately small: drop Gunicorn and run plain uvicorn with one async worker.

CMD ["uvicorn", "main:app", \ "--host", "0.0.0.0", \ "--port", "8000", \ "--timeout-keep-alive", "5", \ "--timeout-graceful-shutdown", "580", \ "--access-log"] The reasoning was straightforward. Container Apps already supervises the process, so Gunicorn's worker manager is redundant with replica restart. Async uvicorn handles concurrency through the event loop; multi-worker only helps with CPU-bound sync code or crash isolation, neither of which matched a lightweight-reads-and-streams workload.

The deploy went out at 17:00 UTC. Average RSS dropped from 1030 MiB at 16:00 to 410 MiB at 17:00 — a −620 MiB step in one hour . The 67 ContainerTerminated events with reason='ProbeFailure' over the previous 10 days fell to single digits within 48 hours. The Front Door alert never fired again.

This was the correct architectural change, but it was not the fix to the underlying bug. It removed a 4× multiplier on a deeper allocator problem....

Jemalloc cut our production memory by 47%

Related Articles

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

German ruling declares Google liable for false answers in AI Overviews

Britain Became as Poor as Mississippi