Concurrency & Multithreading
Modern C++ provides powerful built-in threading primitives. This is essential for high-performance systems, trading engines, and any application that needs to utilize multiple CPU cores.
Table of Contents
1 β Threads
2 β Mutexes
3 β Condition Variables
4 β Atomics
5 β Futures & Promises
6 β Thread-Safe Patterns
7 β Parallel Algorithms (C++17)
8 β Common Concurrency Bugs
Glossary β Key Terms at a Glance
| Term | Meaning |
|---|---|
| Thread | Independent execution path sharing the same address space |
| Mutex | Mutual exclusion β only one thread can hold it at a time |
| Data race | Two threads access same data, at least one writes, no sync β UB |
| Deadlock | Two+ threads each waiting for the otherβs lock β infinite wait |
| Condition variable | Efficiently wait for a condition without busy-looping |
| Atomic | Operation that completes indivisibly β no partial reads/writes |
| Memory ordering | Controls how memory operations are seen across threads |
| Future/Promise | One-shot channel: promise sets a value, future retrieves it |
jthread |
C++20 auto-joining thread with cooperative cancellation |
| False sharing | Different threads write to the same cache line β performance killer |
1 β Threads
1.1 Creating & Joining Threads
#include <thread>
void work(int id) {
std::cout << "Thread " << id << " running\n";
}
int main() {
std::thread t1(work, 1); // launch thread with function + args
std::thread t2(work, 2);
t1.join(); // wait for t1 to finish
t2.join(); // wait for t2 to finish
// MUST join or detach before thread goes out of scope
// Otherwise: std::terminate() is called!
}
Threads with Lambdas:
int result = 0;
std::thread t([&result]() {
result = expensive_computation();
});
t.join();
std::cout << result;
Hardware Concurrency:
unsigned int cores = std::thread::hardware_concurrency();
// Returns number of logical cores (e.g., 8)
β οΈ Detached Threads: t.detach() runs the thread independently with no way to join later. Use rarely β hard to control lifetime. Prefer jthread (C++20).
1.2 jthread (C++20)
#include <thread>
void work(std::stop_token stoken) {
while (!stoken.stop_requested()) {
// do work
}
}
{
std::jthread t(work);
// ... do other stuff ...
} // t automatically joined here + stop requested
π Why
jthread? It auto-joins in the destructor (no morestd::terminatefrom forgetting to join) and supports cooperative cancellation viastop_token.
2 β Mutexes
2.1 lock_guard, unique_lock, scoped_lock
Definition: A mutex (mutual exclusion) ensures only one thread accesses a critical section at a time. Always use RAII wrappers β never raw .lock()/.unlock().
β οΈ Anti-Pattern β Raw lock/unlock:
std::mutex mtx;
mtx.lock();
++counter; // if exception thrown here β deadlock!
mtx.unlock();
std::lock_guard β RAII Mutex Lock (simplest, prefer this):
void increment(int times) {
for (int i = 0; i < times; ++i) {
std::lock_guard<std::mutex> lock(mtx); // locks on construction
++counter;
} // automatically unlocks when lock goes out of scope β EXCEPTION-SAFE
}
std::unique_lock β Flexible Locking:
void process() {
std::unique_lock<std::mutex> lock(mtx);
// ... do work with lock held ...
lock.unlock(); // can manually unlock
// ... do work without lock ...
lock.lock(); // can re-lock
} // automatically unlocks if still locked
// Can also defer locking:
std::unique_lock<std::mutex> lock(mtx, std::defer_lock);
lock.lock(); // lock when ready
std::scoped_lock (C++17) β Lock Multiple Mutexes (deadlock-free):
std::mutex mtx1, mtx2;
void transfer() {
std::scoped_lock lock(mtx1, mtx2); // locks both, deadlock-free
// ... work with both protected resources ...
}
Mutex Types Summary:
| Type | Description |
|---|---|
std::mutex |
Basic non-recursive mutex |
std::recursive_mutex |
Same thread can lock multiple times |
std::timed_mutex |
Has try_lock_for / try_lock_until |
std::shared_mutex |
Reader-writer lock (C++17) |
2.2 Reader-Writer Lock (C++17)
#include <shared_mutex>
std::shared_mutex rw_mtx;
std::map<std::string, int> cache;
int read(const std::string& key) {
std::shared_lock lock(rw_mtx); // multiple readers OK
return cache.at(key);
}
void write(const std::string& key, int value) {
std::unique_lock lock(rw_mtx); // exclusive access
cache[key] = value;
}
π Why Reader-Writer Lock? If reads vastly outnumber writes,
shared_mutexallows concurrent readers while still ensuring exclusive write access. Much better throughput than a plain mutex.
3 β Condition Variables
3.1 Producer-Consumer Pattern
Definition: A condition variable lets a thread efficiently wait for a condition to become true, without busy-waiting (spinning).
#include <condition_variable>
std::mutex mtx;
std::condition_variable cv;
std::queue<int> queue;
bool done = false;
// Producer
void producer() {
for (int i = 0; i < 100; ++i) {
{
std::lock_guard lock(mtx);
queue.push(i);
}
cv.notify_one(); // wake one waiting consumer
}
{
std::lock_guard lock(mtx);
done = true;
}
cv.notify_all(); // wake all β production done
}
// Consumer
void consumer() {
while (true) {
std::unique_lock lock(mtx);
cv.wait(lock, []{ return !queue.empty() || done; });
// Atomically: releases lock, sleeps until notified, re-acquires lock
// Lambda prevents spurious wakeups
if (queue.empty() && done) break;
int item = queue.front();
queue.pop();
lock.unlock();
process(item);
}
}
π Why the Lambda in
cv.wait? Condition variables can wake up spuriously (without actually being notified). The predicate lambda ensures we only proceed when the condition is truly met.
4 β Atomics
4.1 Atomic Operations & Memory Ordering
Definition: An atomic operation completes indivisibly β no other thread can see a partial state. For simple shared variables, atomics are faster than mutexes.
#include <atomic>
std::atomic<int> counter{0};
void increment(int times) {
for (int i = 0; i < times; ++i) {
counter.fetch_add(1, std::memory_order_relaxed);
// Or simply: ++counter; (uses seq_cst by default)
}
}
// Atomic operations
counter.load(); // read
counter.store(42); // write
counter.exchange(10); // swap, return old value
counter.fetch_add(5); // add, return old value
counter.fetch_sub(3); // subtract, return old value
// Compare-and-swap (CAS) β foundation of lock-free algorithms
int expected = 10;
bool success = counter.compare_exchange_strong(expected, 20);
// If counter == 10: sets to 20, returns true
// If counter != 10: sets expected to current value, returns false
| Order | Guarantees | Performance |
|---|---|---|
memory_order_relaxed |
No ordering, just atomicity | Fastest |
memory_order_acquire |
Reads after this see writes from release | Medium |
memory_order_release |
Writes before this are visible to acquire | Medium |
memory_order_acq_rel |
Both acquire and release | Medium |
memory_order_seq_cst |
Total order across all threads | Slowest (default) |
Producer-Consumer with Acquire-Release:
std::atomic<bool> ready{false};
int data = 0;
// Thread 1 (producer)
data = 42;
ready.store(true, std::memory_order_release);
// Everything before store is visible to thread that loads with acquire
// Thread 2 (consumer)
while (!ready.load(std::memory_order_acquire)) { /* spin */ }
assert(data == 42); // guaranteed to see 42
4.2 Spinlock with atomic_flag
class SpinLock {
std::atomic_flag flag = ATOMIC_FLAG_INIT;
public:
void lock() {
while (flag.test_and_set(std::memory_order_acquire)) {
// spin β busy wait
}
}
void unlock() {
flag.clear(std::memory_order_release);
}
};
π When to use a spinlock: Only when the critical section is extremely short (nanoseconds) and you canβt afford the overhead of a kernel mutex. Common in low-latency trading systems.
5 β Futures & Promises
5.1 async, promise, packaged_task
std::async β simplest async execution:
#include <future>
int compute() {
std::this_thread::sleep_for(std::chrono::seconds(2));
return 42;
}
auto future = std::async(std::launch::async, compute);
// Launches in separate thread
// Do other work while compute runs...
int result = future.get(); // blocks until result ready β returns 42
Launch Policies:
auto f1 = std::async(std::launch::async, task); // definitely new thread
auto f2 = std::async(std::launch::deferred, task); // lazy β runs on .get()
auto f3 = std::async(std::launch::async | std::launch::deferred, task); // impl decides
std::promise and std::future β manual channel:
std::promise<int> p;
std::future<int> f = p.get_future();
std::thread t([&p]() {
int result = heavy_computation();
p.set_value(result); // fulfills the promise
});
int result = f.get(); // blocks until value is set
t.join();
std::packaged_task β wraps a callable with a future:
std::packaged_task<int(int, int)> task([](int a, int b) { return a + b; });
auto future = task.get_future();
std::thread t(std::move(task), 3, 4);
int result = future.get(); // 7
t.join();
6 β Thread-Safe Patterns
6.1 Singleton, Thread-Safe Queue, call_once
Thread-Safe Singleton (C++11 guaranteed β βmagic staticsβ):
class Singleton {
public:
static Singleton& instance() {
static Singleton inst; // thread-safe since C++11
return inst;
}
Singleton(const Singleton&) = delete;
Singleton& operator=(const Singleton&) = delete;
private:
Singleton() = default;
};
Thread-Safe Queue:
template<typename T>
class ThreadSafeQueue {
std::queue<T> queue;
mutable std::mutex mtx;
std::condition_variable cv;
public:
void push(T value) {
{
std::lock_guard lock(mtx);
queue.push(std::move(value));
}
cv.notify_one();
}
T pop() {
std::unique_lock lock(mtx);
cv.wait(lock, [this]{ return !queue.empty(); });
T value = std::move(queue.front());
queue.pop();
return value;
}
bool try_pop(T& value) {
std::lock_guard lock(mtx);
if (queue.empty()) return false;
value = std::move(queue.front());
queue.pop();
return true;
}
};
π Why
mutableon the mutex? So thatconstmember functions (like a hypotheticalempty()) can still lock the mutex. The mutex protects shared state but isnβt part of the logical const-ness.
std::call_once β execute initialization exactly once:
std::once_flag init_flag;
void initialize() {
std::call_once(init_flag, []() {
// This runs exactly once, thread-safe
global_resource = expensive_init();
});
}
7 β Parallel Algorithms (C++17)
7.1 Execution Policies
#include <algorithm>
#include <execution>
std::vector<int> v(10'000'000);
// Sequential (default)
std::sort(v.begin(), v.end());
// Parallel
std::sort(std::execution::par, v.begin(), v.end());
// Parallel + vectorized
std::sort(std::execution::par_unseq, v.begin(), v.end());
// Other parallel algorithms
std::for_each(std::execution::par, v.begin(), v.end(),
[](int& x) { x *= 2; });
std::reduce(std::execution::par, v.begin(), v.end(), 0); // parallel sum
std::transform_reduce(std::execution::par, v.begin(), v.end(),
0, std::plus{}, [](int x) { return x * x; }); // parallel sum of squares
Tip: Parallel algorithms are only worth it for large data sets. For small containers, the thread creation overhead outweighs the parallel speedup.
8 β Common Concurrency Bugs
8.1 Deadlock, Data Race, False Sharing
β οΈ Deadlock:
// Thread 1: lock A, then lock B
// Thread 2: lock B, then lock A
// β DEADLOCK β each waits for the other
Solution: Always lock in the same order, or use std::scoped_lock:
std::scoped_lock lock(mtx_a, mtx_b); // deadlock-free
β οΈ Data Race:
int x = 0;
// Thread 1: x = 1;
// Thread 2: std::cout << x;
// β DATA RACE (undefined behavior)
Solution: Mutex or std::atomic<int> x{0};
β οΈ False Sharing: Two threads write to adjacent cache lines:
// BAD: counters[0] and counters[1] share a cache line
std::atomic<int> counters[NUM_THREADS];
// GOOD: pad each counter to its own cache line
struct alignas(64) PaddedCounter {
std::atomic<int> count{0};
};
PaddedCounter counters[NUM_THREADS];
π Why this matters: Even though the variables are independent, they share a 64-byte cache line. Every write invalidates the other threadβs cached copy, causing constant cache misses (up to 100x slower).
Practice Questions
Q1. What happens if you forget to join() or detach() a std::thread before it goes out of scope? How does jthread solve this?
Q2. Explain the difference between lock_guard, unique_lock, and scoped_lock. When would you use each?
Q3. Write a producer-consumer program using condition variables with proper spurious wakeup handling.
Q4. What is the difference between memory_order_relaxed and memory_order_seq_cst? When is relaxed ordering safe to use?
Q5. Implement a simple spinlock using std::atomic_flag. When is it better than a mutex?
Q6. Explain the difference between std::async, std::promise/std::future, and std::packaged_task.
Q7. Write a thread-safe counter class using std::atomic. Compare its performance to a mutex-based version.
Q8. What is false sharing? How do you detect and fix it?
Q9. What guarantees does C++11 provide for initialization of local static variables (magic statics)?
Q10. Design a thread pool that accepts tasks and distributes them across a fixed number of worker threads.
Key Takeaways
- Use
std::lock_guard/std::scoped_lockβ never raw.lock()/.unlock() - Use
std::atomicfor simple shared counters β faster than mutex - Use
std::asyncfor fire-and-forget parallelism - Avoid
std::recursive_mutexβ usually indicates a design problem - Use
std::jthread(C++20) β auto-joins, supports cancellation - Beware of false sharing β pad atomic variables to cache line boundaries
- Test with thread sanitizer:
g++ -fsanitize=thread - Prefer
scoped_lockfor multiple mutexes β deadlock-free by design