Concurrency & Multithreading

Modern C++ provides powerful built-in threading primitives. This is essential for high-performance systems, trading engines, and any application that needs to utilize multiple CPU cores.

1 — Threads

1.1 Creating & Joining Threads
1.2 jthread (C++20)

2 — Mutexes

2.1 lock_guard, unique_lock, scoped_lock
2.2 Reader-Writer Lock (C++17)

3 — Condition Variables

3.1 Producer-Consumer Pattern

4 — Atomics

4.1 Atomic Operations & Memory Ordering
4.2 Spinlock with atomic_flag

5 — Futures & Promises

5.1 async, promise, packaged_task

6 — Thread-Safe Patterns

6.1 Singleton, Thread-Safe Queue, call_once

7 — Parallel Algorithms (C++17)

7.1 Execution Policies

8 — Common Concurrency Bugs

8.1 Deadlock, Data Race, False Sharing

Practice Questions

Glossary — Key Terms at a Glance

Term	Meaning
Thread	Independent execution path sharing the same address space
Mutex	Mutual exclusion — only one thread can hold it at a time
Data race	Two threads access same data, at least one writes, no sync → UB
Deadlock	Two+ threads each waiting for the other’s lock — infinite wait
Condition variable	Efficiently wait for a condition without busy-looping
Atomic	Operation that completes indivisibly — no partial reads/writes
Memory ordering	Controls how memory operations are seen across threads
Future/Promise	One-shot channel: promise sets a value, future retrieves it
`jthread`	C++20 auto-joining thread with cooperative cancellation
False sharing	Different threads write to the same cache line — performance killer

1 — Threads

1.1 Creating & Joining Threads

#include <thread>

void work(int id) {
    std::cout << "Thread " << id << " running\n";
}

int main() {
    std::thread t1(work, 1);     // launch thread with function + args
    std::thread t2(work, 2);
    
    t1.join();   // wait for t1 to finish
    t2.join();   // wait for t2 to finish
    
    // MUST join or detach before thread goes out of scope
    // Otherwise: std::terminate() is called!
}

Threads with Lambdas:

int result = 0;
std::thread t([&result]() {
    result = expensive_computation();
});
t.join();
std::cout << result;

Hardware Concurrency:

unsigned int cores = std::thread::hardware_concurrency();
// Returns number of logical cores (e.g., 8)

⚠️ Detached Threads: t.detach() runs the thread independently with no way to join later. Use rarely — hard to control lifetime. Prefer jthread (C++20).

1.2 jthread (C++20)

#include <thread>

void work(std::stop_token stoken) {
    while (!stoken.stop_requested()) {
        // do work
    }
}

{
    std::jthread t(work);
    // ... do other stuff ...
}  // t automatically joined here + stop requested

🔍 Why jthread? It auto-joins in the destructor (no more std::terminate from forgetting to join) and supports cooperative cancellation via stop_token.

2 — Mutexes

2.1 lock_guard, unique_lock, scoped_lock

Definition: A mutex (mutual exclusion) ensures only one thread accesses a critical section at a time. Always use RAII wrappers — never raw .lock()/.unlock().

⚠️ Anti-Pattern — Raw lock/unlock:

std::mutex mtx;
mtx.lock();
++counter;    // if exception thrown here → deadlock!
mtx.unlock();

std::lock_guard — RAII Mutex Lock (simplest, prefer this):

void increment(int times) {
    for (int i = 0; i < times; ++i) {
        std::lock_guard<std::mutex> lock(mtx);  // locks on construction
        ++counter;
    }  // automatically unlocks when lock goes out of scope — EXCEPTION-SAFE
}

std::unique_lock — Flexible Locking:

void process() {
    std::unique_lock<std::mutex> lock(mtx);
    // ... do work with lock held ...
    lock.unlock();     // can manually unlock
    // ... do work without lock ...
    lock.lock();       // can re-lock
}  // automatically unlocks if still locked

// Can also defer locking:
std::unique_lock<std::mutex> lock(mtx, std::defer_lock);
lock.lock();  // lock when ready

std::scoped_lock (C++17) — Lock Multiple Mutexes (deadlock-free):

std::mutex mtx1, mtx2;

void transfer() {
    std::scoped_lock lock(mtx1, mtx2);  // locks both, deadlock-free
    // ... work with both protected resources ...
}

Mutex Types Summary:

Type	Description
`std::mutex`	Basic non-recursive mutex
`std::recursive_mutex`	Same thread can lock multiple times
`std::timed_mutex`	Has `try_lock_for` / `try_lock_until`
`std::shared_mutex`	Reader-writer lock (C++17)

2.2 Reader-Writer Lock (C++17)

#include <shared_mutex>

std::shared_mutex rw_mtx;
std::map<std::string, int> cache;

int read(const std::string& key) {
    std::shared_lock lock(rw_mtx);     // multiple readers OK
    return cache.at(key);
}

void write(const std::string& key, int value) {
    std::unique_lock lock(rw_mtx);     // exclusive access
    cache[key] = value;
}

🔍 Why Reader-Writer Lock? If reads vastly outnumber writes, shared_mutex allows concurrent readers while still ensuring exclusive write access. Much better throughput than a plain mutex.

3 — Condition Variables

3.1 Producer-Consumer Pattern

Definition: A condition variable lets a thread efficiently wait for a condition to become true, without busy-waiting (spinning).

#include <condition_variable>

std::mutex mtx;
std::condition_variable cv;
std::queue<int> queue;
bool done = false;

// Producer
void producer() {
    for (int i = 0; i < 100; ++i) {
        {
            std::lock_guard lock(mtx);
            queue.push(i);
        }
        cv.notify_one();  // wake one waiting consumer
    }
    {
        std::lock_guard lock(mtx);
        done = true;
    }
    cv.notify_all();  // wake all — production done
}

// Consumer
void consumer() {
    while (true) {
        std::unique_lock lock(mtx);
        cv.wait(lock, []{ return !queue.empty() || done; });
        // Atomically: releases lock, sleeps until notified, re-acquires lock
        // Lambda prevents spurious wakeups
        
        if (queue.empty() && done) break;
        
        int item = queue.front();
        queue.pop();
        lock.unlock();
        
        process(item);
    }
}

🔍 Why the Lambda in cv.wait? Condition variables can wake up spuriously (without actually being notified). The predicate lambda ensures we only proceed when the condition is truly met.

4 — Atomics

4.1 Atomic Operations & Memory Ordering

Definition: An atomic operation completes indivisibly — no other thread can see a partial state. For simple shared variables, atomics are faster than mutexes.

#include <atomic>

std::atomic<int> counter{0};

void increment(int times) {
    for (int i = 0; i < times; ++i) {
        counter.fetch_add(1, std::memory_order_relaxed);
        // Or simply: ++counter;  (uses seq_cst by default)
    }
}

// Atomic operations
counter.load();                     // read
counter.store(42);                  // write
counter.exchange(10);               // swap, return old value
counter.fetch_add(5);               // add, return old value
counter.fetch_sub(3);               // subtract, return old value

// Compare-and-swap (CAS) — foundation of lock-free algorithms
int expected = 10;
bool success = counter.compare_exchange_strong(expected, 20);
// If counter == 10: sets to 20, returns true
// If counter != 10: sets expected to current value, returns false

Order	Guarantees	Performance
`memory_order_relaxed`	No ordering, just atomicity	Fastest
`memory_order_acquire`	Reads after this see writes from release	Medium
`memory_order_release`	Writes before this are visible to acquire	Medium
`memory_order_acq_rel`	Both acquire and release	Medium
`memory_order_seq_cst`	Total order across all threads	Slowest (default)

Producer-Consumer with Acquire-Release:

std::atomic<bool> ready{false};
int data = 0;

// Thread 1 (producer)
data = 42;
ready.store(true, std::memory_order_release);
// Everything before store is visible to thread that loads with acquire

// Thread 2 (consumer)
while (!ready.load(std::memory_order_acquire)) { /* spin */ }
assert(data == 42);  // guaranteed to see 42

4.2 Spinlock with atomic_flag

class SpinLock {
    std::atomic_flag flag = ATOMIC_FLAG_INIT;
public:
    void lock() {
        while (flag.test_and_set(std::memory_order_acquire)) {
            // spin — busy wait
        }
    }
    void unlock() {
        flag.clear(std::memory_order_release);
    }
};

🔍 When to use a spinlock: Only when the critical section is extremely short (nanoseconds) and you can’t afford the overhead of a kernel mutex. Common in low-latency trading systems.

5 — Futures & Promises

5.1 async, promise, packaged_task

std::async — simplest async execution:

#include <future>

int compute() {
    std::this_thread::sleep_for(std::chrono::seconds(2));
    return 42;
}

auto future = std::async(std::launch::async, compute);
// Launches in separate thread
// Do other work while compute runs...
int result = future.get();  // blocks until result ready — returns 42

Launch Policies:

auto f1 = std::async(std::launch::async, task);     // definitely new thread
auto f2 = std::async(std::launch::deferred, task);   // lazy — runs on .get()
auto f3 = std::async(std::launch::async | std::launch::deferred, task); // impl decides

std::promise and std::future — manual channel:

std::promise<int> p;
std::future<int> f = p.get_future();

std::thread t([&p]() {
    int result = heavy_computation();
    p.set_value(result);  // fulfills the promise
});

int result = f.get();  // blocks until value is set
t.join();

std::packaged_task — wraps a callable with a future:

std::packaged_task<int(int, int)> task([](int a, int b) { return a + b; });
auto future = task.get_future();

std::thread t(std::move(task), 3, 4);
int result = future.get();  // 7
t.join();

6 — Thread-Safe Patterns

6.1 Singleton, Thread-Safe Queue, call_once

Thread-Safe Singleton (C++11 guaranteed — “magic statics”):

class Singleton {
public:
    static Singleton& instance() {
        static Singleton inst;  // thread-safe since C++11
        return inst;
    }
    Singleton(const Singleton&) = delete;
    Singleton& operator=(const Singleton&) = delete;
private:
    Singleton() = default;
};

Thread-Safe Queue:

template<typename T>
class ThreadSafeQueue {
    std::queue<T> queue;
    mutable std::mutex mtx;
    std::condition_variable cv;
    
public:
    void push(T value) {
        {
            std::lock_guard lock(mtx);
            queue.push(std::move(value));
        }
        cv.notify_one();
    }
    
    T pop() {
        std::unique_lock lock(mtx);
        cv.wait(lock, [this]{ return !queue.empty(); });
        T value = std::move(queue.front());
        queue.pop();
        return value;
    }
    
    bool try_pop(T& value) {
        std::lock_guard lock(mtx);
        if (queue.empty()) return false;
        value = std::move(queue.front());
        queue.pop();
        return true;
    }
};

🔍 Why mutable on the mutex? So that const member functions (like a hypothetical empty()) can still lock the mutex. The mutex protects shared state but isn’t part of the logical const-ness.

std::call_once — execute initialization exactly once:

std::once_flag init_flag;

void initialize() {
    std::call_once(init_flag, []() {
        // This runs exactly once, thread-safe
        global_resource = expensive_init();
    });
}

7 — Parallel Algorithms (C++17)

7.1 Execution Policies

#include <algorithm>
#include <execution>

std::vector<int> v(10'000'000);

// Sequential (default)
std::sort(v.begin(), v.end());

// Parallel
std::sort(std::execution::par, v.begin(), v.end());

// Parallel + vectorized
std::sort(std::execution::par_unseq, v.begin(), v.end());

// Other parallel algorithms
std::for_each(std::execution::par, v.begin(), v.end(),
    [](int& x) { x *= 2; });
std::reduce(std::execution::par, v.begin(), v.end(), 0);  // parallel sum
std::transform_reduce(std::execution::par, v.begin(), v.end(),
    0, std::plus{}, [](int x) { return x * x; });  // parallel sum of squares

Tip: Parallel algorithms are only worth it for large data sets. For small containers, the thread creation overhead outweighs the parallel speedup.

8 — Common Concurrency Bugs

⚠️ Deadlock:

// Thread 1: lock A, then lock B
// Thread 2: lock B, then lock A
// → DEADLOCK — each waits for the other

Solution: Always lock in the same order, or use std::scoped_lock:

std::scoped_lock lock(mtx_a, mtx_b);  // deadlock-free

⚠️ Data Race:

int x = 0;
// Thread 1: x = 1;
// Thread 2: std::cout << x;
// → DATA RACE (undefined behavior)

Solution: Mutex or std::atomic<int> x{0};

⚠️ False Sharing: Two threads write to adjacent cache lines:

// BAD: counters[0] and counters[1] share a cache line
std::atomic<int> counters[NUM_THREADS];

// GOOD: pad each counter to its own cache line
struct alignas(64) PaddedCounter {
    std::atomic<int> count{0};
};
PaddedCounter counters[NUM_THREADS];

🔍 Why this matters: Even though the variables are independent, they share a 64-byte cache line. Every write invalidates the other thread’s cached copy, causing constant cache misses (up to 100x slower).

Practice Questions

Q1. What happens if you forget to join() or detach() a std::thread before it goes out of scope? How does jthread solve this?

Q2. Explain the difference between lock_guard, unique_lock, and scoped_lock. When would you use each?

Q3. Write a producer-consumer program using condition variables with proper spurious wakeup handling.

Q4. What is the difference between memory_order_relaxed and memory_order_seq_cst? When is relaxed ordering safe to use?

Q5. Implement a simple spinlock using std::atomic_flag. When is it better than a mutex?

Q6. Explain the difference between std::async, std::promise/std::future, and std::packaged_task.

Q7. Write a thread-safe counter class using std::atomic. Compare its performance to a mutex-based version.

Q8. What is false sharing? How do you detect and fix it?

Q9. What guarantees does C++11 provide for initialization of local static variables (magic statics)?

Q10. Design a thread pool that accepts tasks and distributes them across a fixed number of worker threads.

Key Takeaways

Use std::lock_guard / std::scoped_lock — never raw .lock()/.unlock()
Use std::atomic for simple shared counters — faster than mutex
Use std::async for fire-and-forget parallelism
Avoid std::recursive_mutex — usually indicates a design problem
Use std::jthread (C++20) — auto-joins, supports cancellation
Beware of false sharing — pad atomic variables to cache line boundaries
Test with thread sanitizer: g++ -fsanitize=thread
Prefer scoped_lock for multiple mutexes — deadlock-free by design

← Back to C++ Notes

🔒 Private Site

Concurrency & Multithreading

Table of Contents

Glossary — Key Terms at a Glance

1 — Threads

1.1 Creating & Joining Threads

1.2 jthread (C++20)

2 — Mutexes

2.1 lock_guard, unique_lock, scoped_lock

2.2 Reader-Writer Lock (C++17)

3 — Condition Variables

3.1 Producer-Consumer Pattern

4 — Atomics

4.1 Atomic Operations & Memory Ordering

4.2 Spinlock with atomic_flag

5 — Futures & Promises

5.1 async, promise, packaged_task

6 — Thread-Safe Patterns

6.1 Singleton, Thread-Safe Queue, call_once

7 — Parallel Algorithms (C++17)

7.1 Execution Policies

8 — Common Concurrency Bugs

Practice Questions

Key Takeaways

🔒 Private Site

Concurrency & Multithreading

Table of Contents

Glossary — Key Terms at a Glance

1 — Threads

1.1 Creating & Joining Threads

1.2 jthread (C++20)

2 — Mutexes

2.1 lock_guard, unique_lock, scoped_lock

2.2 Reader-Writer Lock (C++17)

3 — Condition Variables

3.1 Producer-Consumer Pattern

4 — Atomics

4.1 Atomic Operations & Memory Ordering

4.2 Spinlock with atomic_flag

5 — Futures & Promises

5.1 async, promise, packaged_task

6 — Thread-Safe Patterns

6.1 Singleton, Thread-Safe Queue, call_once

7 — Parallel Algorithms (C++17)

7.1 Execution Policies

8 — Common Concurrency Bugs

8.1 Deadlock, Data Race, False Sharing

Practice Questions

Key Takeaways