How thread locals work on macOS

yuraiz1 pts0 comments

New Babylon

How thread locals work on macOS

2026-05-30

I've spent significant time figuring out how to implement thread local inspection support for a debugger on macOS so I hope this article may be helpful for someone.

If you want to understand thread locals from the point of a static linker there's a good article by Jakub Konka. Also, I'm not currently interested in the Intel macs, so the article touches only Apple Silicon.

Let's start with a simple code example:

#include

thread_local int first = 42;<br>thread_local int second = 390;

int main() {<br>int* thread_local_addr1 = &first;<br>int* thread_local_addr2 = &second;

printf("The addr of the 1st thread local: %p\n", thread_local_addr1);<br>printf(" 2nd thread local: %p\n", thread_local_addr2);

return 0;<br>You can compile it like

$ clang -std=c23 thread_locals.c -g -o thread_locals<br>(I use C23 to get the thread_local keyword, otherwise you may just use __thread on Clang)

Let's step into the code using a debugger:

That code stores the __thread_vars section address to x0, adds an offset that's a multiple of 24 and calls the function by the address from in first 8 bytes of the value.

We can read the memory at the address:

You may notice that the value 88 83 08 8e 01 is repeated after 24 bytes from the start. That's the address of the function we branch to.

If we disassemble that address we get the following code:

That's actually the code of __tlv_get_addr from dyld, and the source code is available on github. Note that all of the code snippets from dyld are simplified to include only the ARM64 parts of the code:

// Parameters: x0 = descriptor<br>// Result: x0 = address of TLV<br>__tlv_get_addr:<br>ldr w16, [x0, #8] // get key from descriptor (TLV_Thunkv2.key)<br>mrs x17, TPIDRRO_EL0<br>and x17, x17, #-8 // clear low 3 bits<br>ldr x17, [x17, x16, lsl #3] // get thread allocation address for this key<br>cbz x17, LlazyAllocate // if NULL, lazily allocate<br>ldr w16, [x0, #12] // get offset from descriptor (TLV_Thunkv2.offset)<br>add x0, x17, x16 // return allocation+offset<br>ret lr<br>x0 points to a structure like that:

// runtime structure of 64-bit arch thread-local thunk<br>struct TLV_Thunkv2<br>// points to __tlv_get_addr<br>void* func;<br>// pthread key<br>uint32_t key;<br>// offset from the start of the memory allocation<br>uint32_t offset;<br>// if zero, then content is all zeros,<br>// otherwise offset from the address of the field<br>int32_t initialContentDelta;<br>// the size of the allocation block<br>uint32_t initialContentSize;<br>};<br>The only thing LlazyAllocate does is store the registers and call ThreadLocalVariables::instantiateVariable that allocates and initializes the memory block:

void* ThreadLocalVariables::instantiateVariable(const Thunk& thunk)<br>void* buffer = nullptr;<br>dyld_thread_key_t key = 0;

TLV_Thunkv2* thunkv2 = (TLV_Thunkv2*)&thunk;<br>key = thunkv2->key;<br>if (thunkv2->initialContentDelta != 0) {<br>// initial content of thread-locals is non-zero so copy initial bytes from template<br>buffer = malloc(thunkv2->initialContentSize);<br>const uint8_t* initialContent = (uint8_t*)(&thunkv2->initialContentDelta) +<br>thunkv2->initialContentDelta;<br>memcpy(buffer, initialContent, thunkv2->initialContentSize);<br>else {<br>// initial content of thread-locals is all zeros<br>buffer = calloc(thunkv2->initialContentSize, 1);

// set this thread's value for key to be the new buffer.<br>dyld_thread_setspecific(key, buffer);

return buffer;<br>A single memory allocation is used for both of the thread locals here, I don't know how it is decided to split thread locals into separate sections.

Now you've seen (most of) the code the program executes to get an address for a thread local. But how does one compute the address in a debugger?

Well, the debug info actually provides the address of TLV_Thunk2 relative to the start of the module:

So the only thing you need to do in a debugger is to read the pointer to __tlv_get_addr and to evaluate it, like LLDB does.

I didn't want to implement function calling just to support thread locals, so I figured out an alternative implementation:

The hardest thing was getting the value of the tpidrro_el0 system register. The problem is that there is no API to read system register state for a thread, but it so happens that thread_identifier_info->thread_handle contains the same address. It points to the pthread's internal "thread specific data slots" array. I found this out after reading the code of the kernel and pthreads.

You can get the value like that:

struct thread_identifier_info identifier_info;<br>mach_msg_type_number_t count = THREAD_IDENTIFIER_INFO_COUNT;<br>kern_return_t kr = thread_info(thread_port,<br>THREAD_IDENTIFIER_INFO,<br>(thread_info_t)&identifier_info,<br>&count);

uint64_t tsd_vaddr = identifier_info.thread_handle;<br>The rest is comparatively trivial:

uint64_t thunk_offset = // ...<br>uint64_t tsd_vaddr = // ...<br>uint64_t result = 0;

typedef struct<br>uint64_t func;<br>uint32_t key;<br>uint32_t offset;<br>int32_t init_delta;<br>uint32_t init_size;<br>} TLV_Thunkv2;

TLV_Thunkv2 tlv_thunk = {0};<br>// assuming the memory...

thread address from code locals offset

Related Articles