Building a Fast Lock-Free Queue in Modern C++ From Scratch
22 May 2026<br>Building a Fast Lock-Free Queue in Modern C++ From Scratch
For those just interested in the code you can find it here and if you want to see it action in an actual project you can check this out!
Now to set the context for those who might be new to this, typically a Queue is one of the most fundamental data structures in computer science, and it is used everywhere in software development. A queue is a simple first-in-first-out (FIFO) structure where you can Push items to the back and Pop items from the front. In single-threaded code, there are already implementations in most standard libraries, for instance in C++ its std::queue, and even implement it from sratch is pretty trivial.
So why all the trouble you ask?
Its simply because when we move over to a multi threaded environemnt(a typical requirement for a whole lot of real world applications), the whole thing becomes a whole lot more complex, now you have multiple threads contending or fighting for the data in your queue, and if you dont manage it properly you get the classical Undefined Behaviour, that is one theads makes some chnages while another tires to read causing all sort of corruptions.
Now before you proceed, its very important to ask yourself,
Do you reallly need a fast lock-free queue?
Then answer is for most applications you typically dont. You are all good with typicall mutex based synchronizatons you can add on a regular queue.
However, once you start dealing with more demanding things like high frequency trading systems, real time game engines, audio synthesis pipelines, massive data ingestion systems, or any situtuation where you got a large number of threads trying to communicate, the performance of the queue suddenly starts mattering a lot. In these systems even a fraction of a millisecond of delay in one queue operation cascades into stuttering audio, dropped frames, missed trades, or just an overall sluggish system. So we really do want our queue to be fast, ideally something that scales linearly with the number of threads instead of falling apart the moment you have more than a couple of producers and consumers.
For me personally, I was working on optimizing my software renderer by re-architecturing thing a bit, and this seemed to be one of the bigged contenders with all the multithreaded CPU work going on.
The Idea? (Motivation & Baselines)
Alright, almost every C++ tutorial about threading tells you to just slap a std::mutex on top of a std::queue and call it a day. And honestly for 90% of all cases that is exactly the right call, mutexes are well tested, easy to reason about, and the standard library implementation is genuinely fast for low contention scenarios. So lets not throw that away, lets start there. We will write the simplest possible thread safe queue, call it SimpleQueue, and use it as our baseline to compare everything against.
The implementation is essentially what you would write as a starting point for any multi-threaded application. We wrap a std::queue with a std::mutex, and every Push and Pop operation just acquires the mutex, does the queue operation, and releases the mutex. We also make it non copyable and non movable because, well, copying a mutex doesnt make any sense. And we add a threadId parameter to all the methods even though we dont use it, because all our queues later will need it, and keeping the API identical makes benchmarking and comparing the queues easier.
template typename T><br>class SimpleQueue {<br>public:<br>SimpleQueue() = default;<br>SimpleQueue(SimpleQueueT>&) = delete;<br>SimpleQueue(SimpleQueueT>&&) = delete;
void Push(T* item, uint32_t threadId) {<br>(void)threadId;<br>std::unique_lock lock(m_Mutex);<br>m_Queue.push(*item);
bool Pop(T* item, uint32_t threadId) {<br>(void)threadId;<br>std::unique_lock lock(m_Mutex);<br>if (m_Queue.empty()) {<br>return false;<br>*item = std::move(m_Queue.front());<br>m_Queue.pop();<br>return true;
// This is an extremely handy function which we can add to the<br>// Queue API so that we can use it to check if a queue is filled<br>// or not and this will be very useful for a lot of situations<br>// where you could technically put a thread to sleep based on<br>// this value.<br>bool IsEmpty(uint32_t threadId) {<br>(void)threadId;<br>std::unique_lock lock(m_Mutex);<br>return m_Queue.empty();
private:<br>std::queueT> m_Queue;<br>mutable std::mutex m_Mutex;<br>};<br>Looks incredibly clean right? It works, its correct, and for 2 or 3 threads gently feeding each other data, its honestly pretty fast. But ask yourself, what actually happens when you have 16 threads all hammering this queue thousands of times per second? What does the mutex really do under the hood when a thread tries to lock something that is already locked by another thread? The thread cant just spin forever wasting CPU, so what does it do?
The OS scheduler steps in, and it does something really expensive called a context switch . When a thread fails to grab a lock, the kernel marks it...