The Windows DLL loader lock: how a Rust thread can hang your JVM

The Windows DLL loader lock: how a Rust thread can hang your JVM | QuestDB New: QuestDB For AI Agents New: QuestDB For AI Agents Learn more

QuestDB is the open-source time-series database for demanding workloads—from trading floors to mission control. It delivers ultra-low latency, high ingestion throughput, and a multi-tier storage engine. Native support for Parquet and SQL keeps your data portable, AI-ready—no vendor lock-in.

Introduction

Several weeks ago, we encountered a silent, sporadic hang in our Windows CI pipeline. After a deep investigation, we uncovered a deadlock that left processes completely frozen with no ability to extract a Java stack trace.

This blog post walks through our debugging journey and includes low-level details about the Java Virtual Machine's garbage collection, Rust's thread-local storage, the JNI (Java Native Interface) attachment protocol, and a core Windows kernel primitive known as the Loader Lock.

TL;DR:

On Windows, the OS holds the process-wide Loader Lock during thread termination (specifically during Rust's TLS destruction).

TLS destruction triggers jni-rs, which tries to detach the thread from the JVM. This step transitions the thread from "Native" to "VM" state, and because the GC is running, this transition is blocked at the Safepoint Barrier . The Rust thread waits for the GC to unpark it.

Simultaneously, the GC is waiting for a newly spawning Java thread to report in. However, this new thread cannot reach the safepoint; it is blocked in the OS initialization phase, waiting for the Loader Lock (held by the Rust thread).

The First Clues: A Local Reproducer and Thread Dumps

Our CI pipeline runs a suite of tests on Linux, MacOS and Windows using Azure Pipelines. On Windows, we noticed that some test suites would occasionally hang until the job timed out.

My first reflex was to replicate the issue locally in order to gather more details. After a few attempts, the hang occurred, and I was able to capture a process dump.

With this process dump, I was able to extract native stacks using WinDbg and Java stacks using jhsdb. We found three clues:

The main thread was stuck in GC:

"Time-limited test" #4053 daemon prio=5 tid=0x0000019183c06c30 nid=0x2270 waiting on condition [0x00000019f0afe000] java.lang.Thread.State: RUNNABLE JavaThread state: _thread_blocked - java.lang.Runtime.gc() @bci=0 (Compiled frame; information may be imprecise) - java.lang.System.gc() @bci=3, line=1907 (Compiled frame) - io.questdb.ServerMain.start(boolean) @bci=66, line=251 (Interpreted frame) - locked (a com.questdb.AbstractEntBootstrapTest$EntGriffinServerMain) - io.questdb.ServerMain.start() @bci=2, line=239 (Interpreted frame)

It was waiting for all threads to reach a "safepoint", a JVM mechanism for safely pausing threads during VM-level operations, including garbage collection.

Several Rust Tokio worker threads were in their on_thread_stop hook, making a JNI call.

53 Id: 4994.575c Suspend: 0 Teb: 00000019`f07a8000 Unfrozen "tokio-runtime-worker" # Call Site 00 ntdll!NtWaitForSingleObject+0x14 01 KERNELBASE!WaitForSingleObjectEx+0x8e 02 jvm!XXX+0x1cb607 03 jvm!XXX+0x5e9b4 04 jvm!XXX+0x624b9 05 jvm!XXX+0x11a06f 06 qdb_ent14818614347342639976!jni::wrapper::jnienv::JNIEnv::call_method_unchecked,jni::wrapper::objects::jmethodid::JMethodID>+0xa681 [C:\w\.cargo\registry\src\index.crates.io-1949cf8c6b5b557f\jni-0.21.1\src\wrapper\macros.rs @ 86] 07 qdb_ent14818614347342639976!qdb_ent::call_method,qdb_ent::call_void_method::closure_env$0>+0xf0 [C:\w\questdb-ent\rust\qdb-ent\src\lib.rs @ 56] 08 qdb_ent14818614347342639976!qdb_ent::call_void_method+0x49 [C:\w\questdb-ent\rust\qdb-ent\src\lib.rs @ 86] 09 qdb_ent14818614347342639976!qdb_ent::tokio::ThreadLifetimeListener::on_thread_stop+0x35 [C:\w\questdb-ent\rust\qdb-ent\src\tokio.rs @ 46] 0a qdb_ent14818614347342639976!qdb_ent::tokio::Java_com_questdb_tokio_TokioRuntime_create::closure$4+0xe [C:\w\questdb-ent\rust\qdb-ent\src\tokio.rs @ 101]

Note: we have hidden some addresses with XXX because the debug symbols were not available.

There were several unnamed threads laying around.

Aside: What is a Safepoint?

safepoint is a point in execution where a thread's state is fully describable to the JVM: all object references reside in known locations (registers, stack slots, or heap), and no heap mutation is in flight. The JVM can only perform certain global operations - most notably GC - when all mutator threads are stopped at a safepoint simultaneously. (Since JDK 10, thread-local handshakes allow some operations on individual threads, but GC still requires a global stop.)

The mechanism: the JVM pre-allocates two contiguous memory pages - a "bad" page (no access) and a "good" page (readable). The JIT compiler emits polling instructions at method returns and loop back-edges. To arm a safepoint, the VM switches threads' poll addresses from the good page to the bad page. Reading the bad page triggers a SIGSEGV (or access violation on...

The Windows DLL loader lock: how a Rust thread can hang your JVM

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast