The Windows DLL loader lock: how a Rust thread can hang your JVM | QuestDB<br>New: QuestDB For AI Agents<br>New: QuestDB For AI Agents<br>Learn more
QuestDB is the open-source time-series database for demanding workloads—from trading floors to mission control.<br>It delivers ultra-low latency, high ingestion throughput, and a multi-tier storage engine.<br>Native support for Parquet and SQL keeps your data portable, AI-ready—no vendor lock-in.
Introduction
Several weeks ago, we encountered a silent, sporadic hang in our Windows CI<br>pipeline. After a deep investigation, we uncovered a deadlock that left<br>processes completely frozen with no ability to extract a Java stack trace.
This blog post walks through our debugging journey and includes low-level<br>details about the Java Virtual Machine's garbage collection, Rust's thread-local<br>storage, the JNI (Java Native Interface) attachment protocol, and a core Windows<br>kernel primitive known as the Loader Lock.
TL;DR:
On Windows, the OS holds the process-wide Loader Lock during thread<br>termination (specifically during Rust's TLS destruction).
TLS destruction triggers jni-rs, which tries to detach the thread from the<br>JVM. This step transitions the thread from "Native" to "VM" state, and because<br>the GC is running, this transition is blocked at the Safepoint Barrier .<br>The Rust thread waits for the GC to unpark it.
Simultaneously, the GC is waiting for a newly spawning Java thread to report<br>in. However, this new thread cannot reach the safepoint; it is blocked in the<br>OS initialization phase, waiting for the Loader Lock (held by the Rust<br>thread).
The First Clues: A Local Reproducer and Thread Dumps
Our CI pipeline runs a suite of tests on Linux, MacOS and Windows using Azure<br>Pipelines. On Windows, we noticed that some test suites would occasionally hang<br>until the job timed out.
My first reflex was to replicate the issue locally in order to gather more<br>details. After a few attempts, the hang occurred, and I was able to capture a<br>process dump.
With this process dump, I was able to extract native stacks using<br>WinDbg<br>and Java stacks using<br>jhsdb. We found<br>three clues:
The main thread was stuck in GC:
"Time-limited test" #4053 daemon prio=5 tid=0x0000019183c06c30 nid=0x2270 waiting on condition [0x00000019f0afe000]<br>java.lang.Thread.State: RUNNABLE<br>JavaThread state: _thread_blocked<br>- java.lang.Runtime.gc() @bci=0 (Compiled frame; information may be imprecise)<br>- java.lang.System.gc() @bci=3, line=1907 (Compiled frame)<br>- io.questdb.ServerMain.start(boolean) @bci=66, line=251 (Interpreted frame)<br>- locked (a com.questdb.AbstractEntBootstrapTest$EntGriffinServerMain)<br>- io.questdb.ServerMain.start() @bci=2, line=239 (Interpreted frame)
It was waiting for all threads to reach a "safepoint", a JVM mechanism for<br>safely pausing threads during VM-level operations, including garbage collection.
Several Rust Tokio worker threads were in their on_thread_stop hook,<br>making a JNI call.
53 Id: 4994.575c Suspend: 0 Teb: 00000019`f07a8000 Unfrozen "tokio-runtime-worker"<br># Call Site<br>00 ntdll!NtWaitForSingleObject+0x14<br>01 KERNELBASE!WaitForSingleObjectEx+0x8e<br>02 jvm!XXX+0x1cb607<br>03 jvm!XXX+0x5e9b4<br>04 jvm!XXX+0x624b9<br>05 jvm!XXX+0x11a06f<br>06 qdb_ent14818614347342639976!jni::wrapper::jnienv::JNIEnv::call_method_unchecked,jni::wrapper::objects::jmethodid::JMethodID>+0xa681 [C:\w\.cargo\registry\src\index.crates.io-1949cf8c6b5b557f\jni-0.21.1\src\wrapper\macros.rs @ 86]<br>07 qdb_ent14818614347342639976!qdb_ent::call_method,qdb_ent::call_void_method::closure_env$0>+0xf0 [C:\w\questdb-ent\rust\qdb-ent\src\lib.rs @ 56]<br>08 qdb_ent14818614347342639976!qdb_ent::call_void_method+0x49 [C:\w\questdb-ent\rust\qdb-ent\src\lib.rs @ 86]<br>09 qdb_ent14818614347342639976!qdb_ent::tokio::ThreadLifetimeListener::on_thread_stop+0x35 [C:\w\questdb-ent\rust\qdb-ent\src\tokio.rs @ 46]<br>0a qdb_ent14818614347342639976!qdb_ent::tokio::Java_com_questdb_tokio_TokioRuntime_create::closure$4+0xe [C:\w\questdb-ent\rust\qdb-ent\src\tokio.rs @ 101]
Note: we have hidden some addresses with XXX because the debug symbols were<br>not available.
There were several unnamed threads laying around.
Aside: What is a Safepoint?
safepoint<br>is a point in execution where a thread's state is fully describable to the<br>JVM: all object references reside in known locations (registers, stack slots,<br>or heap), and no heap mutation is in flight. The JVM can only perform certain<br>global operations - most notably GC - when all mutator threads are stopped<br>at a safepoint simultaneously. (Since JDK 10,<br>thread-local handshakes allow some operations<br>on individual threads, but GC still requires a global stop.)
The mechanism: the JVM<br>pre-allocates<br>two contiguous memory pages - a "bad" page (no access) and a "good" page<br>(readable). The JIT<br>compiler emits<br>polling instructions at method returns and loop back-edges. To arm a<br>safepoint, the<br>VM switches<br>threads' poll addresses from the good page to the<br>bad page.<br>Reading the bad page triggers a SIGSEGV (or access violation on...