Concurrency & threads
The 2025–26 frontier-lab differentiator. Anthropic, xAI, OpenAI all probe explicitly. This page teaches the GIL deep-dive, asyncio internals, lock-free patterns, and at least seeing a memory model — densely enough that one read covers an interview round.
What you'll learn
- Threads vs processes vs coroutines — pick the right tool
- Memory models — what the CPU and compiler don't promise
- Locks — the menu and when each fits
- Deadlock — Coffman's four conditions and how to break them
- Producer-consumer — the canonical pattern (with code)
- Python GIL — what it does, what it doesn't, free-threaded Python
- asyncio internals — the event loop and its sharp edges
- Multiprocessing & shared memory — true parallelism in Python
- Compare-and-swap, ABA, lock-free patterns
- RCU and hazard pointers — advanced reclamation
- Concurrency LeetCodes — the labs ask these
- What companies actually probe
Three concurrency primitives, three cost profiles. Coroutines win for I/O at scale (10k+ connections). Threads win for mixed CPU+I/O work in non-GIL languages or with C extensions. Processes win for true CPU parallelism in Python. Picking wrong = an order-of-magnitude perf hit.
| Thread | Process | Coroutine (asyncio) | |
|---|---|---|---|
| Memory | Shared with peers | Isolated | Shared (single thread) |
| Switch cost | μs (kernel) | ms (kernel) | ~ns (user) |
| Use case | I/O concurrency in non-GIL langs; mixed CPU/IO | CPU parallelism in Python; isolation | Massive I/O concurrency (10k+ conns) |
| Hazards | Races, deadlock | IPC overhead | Single blocking call kills throughput |
When to reach for threads
- Mixed CPU + I/O
- C extensions release the GIL (NumPy, PyTorch)
- Non-Python languages (Java, Go, C++)
When to reach for asyncio
- Lots of network I/O, low CPU
- 10k+ concurrent connections
- Web servers, scrapers, RPC clients
- Coroutines are ~1000× cheaper to switch than threads.
- Threads share memory (races possible); processes don't (need IPC).
- One sync blocking call inside an asyncio loop kills the whole party.
Modern CPUs and compilers reorder reads and writes for performance. Without explicit synchronization, what one thread sees from another can differ wildly from program order. Sequential consistency is the mental model you wish was true; in practice you write to x86 (TSO) or ARM (weak) and use atomics + memory orders to force the guarantees you need.
The four consistency points to know
- Sequential consistency: every operation appears in a global total order; each thread's order matches its program order. The mental model you wish was true. Almost no real hardware provides it for free.
- x86 (TSO): total store order. Stores are not reordered with stores; loads are not reordered with loads; stores can be delayed past later loads (store buffer). Strong-ish, but the store-buffer rule trips people up.
- ARM / POWER (weak): most reorderings allowed. Need explicit barriers (DMB, ISB).
- Happens-before: partial order across threads enforced by synchronization (lock acquire/release, atomic with appropriate ordering, thread join).
C++ memory orders (memorize)
| Order | What it guarantees |
|---|---|
memory_order_seq_cst (default) | Single total order across all SC operations. |
memory_order_acquire (load) | Subsequent reads/writes don't reorder before this load. Pairs with release. |
memory_order_release (store) | Prior reads/writes don't reorder after this store. Pairs with acquire. |
memory_order_acq_rel (RMW) | Both — for compare-exchange, fetch-add, etc. |
memory_order_relaxed | Atomicity only, no ordering with other ops. Cheapest. Use for stat counters. |
memory_order_consume | Rarely used. Treated as acquire by most compilers. |
SPSC ring needs only acquire/release on the head/tail pointers — the data writes themselves can use relaxed because the index publication establishes happens-before. This is the canonical Anthropic concurrency probe.
- Without atomics + memory orders, the CPU and compiler reorder freely.
- x86 ≠ ARM. Don't ship code that "works on my Mac" without testing on ARM.
- Acquire/release pairs establish happens-before;
relaxedis for unordered counters.
Mutex by default. Spinlock only if the critical section is microseconds. RWLock if reads dominate writes. Lock-free if you're willing to spend 10× the eng time for higher throughput. Recursive lock = a code smell signaling tangled control flow.
| Lock | What it does | When to use |
|---|---|---|
| Mutex | Blocks waiters; OS-managed | Default for any non-trivial critical section |
| Spinlock | Busy-wait until acquired | Critical section < ~1μs and contention rare |
| RWLock | Many readers OR one writer | Read-heavy workload (caches, config). Hurts under writes. |
| Recursive lock | Same thread can re-acquire | Almost never — usually a sign the code is tangled |
| Lock-free | At least one thread always progresses | Hot data structures (queues, hashmaps) where contention is the bottleneck |
| Wait-free | Every thread progresses in bounded steps | Real-time / safety-critical. Rare due to complexity. |
- Default to mutex. Promote only when you have data showing it's the bottleneck.
- Spinlock = bet that contention is rare AND critical sections are tiny.
- Lock-free ≠ free. Wait-free is a strictly stronger guarantee almost no one needs.
All four conditions must hold for deadlock to occur. Break any one to prevent it. The cheapest fix in production: global lock ordering — always acquire locks in the same canonical order across the codebase.
- Mutual exclusion — only one thread holds the resource.
- Hold and wait — a thread holds one resource while waiting for another.
- No preemption — resources can't be forcibly released.
- Circular wait — a cycle exists in the resource-wait graph.
Practical deadlock-prevention recipes:
- Lock ordering: enforce a global order (e.g., always acquire
account_abeforeaccount_bifid(a) < id(b)). Breaks circular wait. - Try-lock with backoff: if you can't get all needed locks, release what you have and retry. Breaks hold-and-wait.
- Single coarse lock: when in doubt and not perf-critical, use one lock. Eliminates circular wait by construction.
- All four Coffman conditions required → break any one to prevent deadlock.
- Global lock ordering is the cheapest fix that scales to large codebases.
- Livelock (threads constantly retrying without progress) is harder to detect than deadlock.
Mutex + two condition variables (not_full, not_empty) + a queue + capacity. Always wait in a while loop (spurious wakeups). The canonical multi-thread interview drill — implement it cold in your sleep.
Python — using queue.Queue
import threading, queue
q = queue.Queue(maxsize=10) # thread-safe, blocking on full/empty
def producer():
for i in range(100):
q.put(i) # blocks if full
print(f"produced {i}")
def consumer():
while True:
item = q.get() # blocks if empty
print(f"consumed {item}")
q.task_done()
threading.Thread(target=producer, daemon=True).start()
threading.Thread(target=consumer, daemon=True).start()
q.join()
C++ — explicit condition variables
#include <mutex>
#include <condition_variable>
#include <queue>
template<typename T>
class BoundedQueue {
std::mutex m;
std::condition_variable not_full, not_empty;
std::queue<T> q;
size_t cap;
public:
BoundedQueue(size_t c) : cap(c) {}
void put(T x) {
std::unique_lock lk(m);
not_full.wait(lk, [&]{ return q.size() < cap; });
q.push(std::move(x));
not_empty.notify_one();
}
T take() {
std::unique_lock lk(m);
not_empty.wait(lk, [&]{ return !q.empty(); });
T x = std::move(q.front()); q.pop();
not_full.notify_one();
return x;
}
};
The above already works for N producers and M consumers — the mutex serializes access; condition variables wake exactly one waiter per notify. Use notify_all if your predicate change might unblock multiple waiters (e.g., adding K items at once instead of one).
if vs while bugcv.wait in a while (!predicate()) loop, never if. Spurious wakeups are real (POSIX permits them). With if, your code wakes up in an invalid state and crashes.
- Mutex + 2 CVs + while-predicate-wait. That's the pattern.
notify_onewhen one waiter can make progress;notify_allwhen many might.- Python: just use
queue.Queueunless you need custom predicates.
The GIL serializes Python bytecode execution: only one thread runs Python at a time. It's released during I/O syscalls and in C extensions that ask for release (NumPy, PyTorch CUDA). It does NOT help CPU-bound pure Python — for that, use multiprocessing or free-threaded Python (3.13t, PEP 703).
It's not "Python isn't multi-threaded" — it's a specific lock with specific release rules
- Released during I/O:
read(),send(),sleep(), file I/O, network I/O — threading helps for I/O-bound work. - Released in C extensions that ask: NumPy, PyTorch CUDA, lots of native libraries — threading helps for those too.
- Doesn't help CPU-bound pure Python:
for i in range(N): x += 1— single-thread serialized. Use multiprocessing. - Free-threaded Python (3.13t, 2024): experimental GIL-disabled build. ~10–30% single-thread overhead, but true parallel Python on multiple cores. Not yet default; library compatibility is growing fast (NumPy, PyTorch, Pandas already work).
- Subinterpreters (3.12+): each interpreter has its own GIL; can run truly in parallel within one process. Communication via shared memory.
What's atomic in Python?
Single bytecode ops are atomic (e.g., list.append, dict[k] = v). But x += 1 is not — it's load + add + store, three separate bytecodes. Two threads can both load, both add, both store — losing one increment.
counter += 1 from N threads still loses updates without a lock.
- GIL = serializes bytecode, not the world.
- Released during I/O syscalls and in cooperating C extensions.
- For CPU-bound Python:
multiprocessingor free-threaded 3.13t. - Read-modify-write at the Python level still races without a lock.
Single-threaded event loop. await yields control back to the loop. asyncio.gather runs coroutines concurrently. The fatal failure mode: a single sync blocking call freezes the entire loop and every other coroutine.
The mental model
- Tasks wrap coroutines; scheduled on the loop.
- Futures are awaitables; resolved by external code.
- Event loop uses
selectors(epoll on Linux, kqueue on macOS) under the hood for I/O readiness. - Cancellation:
task.cancel()raisesCancelledErrorat the nextawaitpoint.
import asyncio
async def fetch(url):
print(f"start {url}")
await asyncio.sleep(1) # simulated I/O — yields control to loop
print(f"done {url}")
return url
async def main():
results = await asyncio.gather(*(fetch(u) for u in urls))
return results
asyncio.run(main())
Why a sync call freezes every other coroutine
The loop is single-threaded. While it executes a sync function, no other coroutine runs. One time.sleep(1), sync DB query, or requests.get() freezes the entire loop for 1 second — every other concurrent task pauses too.
Fix: use async equivalents (await asyncio.sleep, async DB drivers like asyncpg, aiohttp instead of requests); or wrap blocking calls in asyncio.to_thread() / loop.run_in_executor().
- Event loop = single thread + selector (epoll/kqueue) + cooperative scheduling.
- Every
awaitis a yield point. Anything between two awaits runs uninterrupted. - Sync blocking calls inside async = catastrophic. Use
to_thread()if you must.
Multiprocessing gives you actual parallelism in Python by spawning OS processes. The bottleneck is data transfer — pickle round-trips can dwarf the compute. Use multiprocessing.shared_memory for zero-copy NumPy sharing.
from multiprocessing import shared_memory
import numpy as np
# Producer
shm = shared_memory.SharedMemory(create=True, size=4 * 1024 * 1024)
arr = np.ndarray((1024, 1024), dtype=np.float32, buffer=shm.buf)
arr[:] = np.random.randn(1024, 1024)
print(shm.name) # pass to consumer
# Consumer (different process)
shm = shared_memory.SharedMemory(name="...")
arr = np.ndarray((1024, 1024), dtype=np.float32, buffer=shm.buf)
# zero-copy access; no pickle round-trip
Pure-Python computation on independent chunks of data. Image preprocessing, simulations, hyperparam sweeps. Loses when data transfer dominates — e.g., transferring a 1GB array to children that do 10ms of work each.
- Multiprocessing = real parallel Python by spawning OS processes.
- Pickle is the hidden cost. Use
shared_memoryfor large arrays. - Consider Ray / Dask / joblib for production work — they handle the boilerplate.
CAS(addr, expected, new) atomically sets a value if it matches expected. The foundation of lock-free programming. The classic trap is the ABA problem — a value can change A → B → A while you're preempted, and your CAS will succeed on stale assumptions.
CAS — the building block
CAS(addr, expected, new): atomically set *addr = new if *addr == expected. Returns success/failure. Hardware provides this via CMPXCHG (x86) or LDREX/STREX (ARM). Foundation of lock-free queues, hashmaps, ref counts.
The ABA problem
Thread T1 reads A, gets preempted. Thread T2 changes A → B → A. T1 resumes, runs CAS expecting A — it succeeds, but the world changed underneath. T1's logic may now be invalid (e.g., the node it pointed to was freed and the address reused).
Mitigations:
- Tagged pointers: pointer + version counter (e.g., 48-bit ptr + 16-bit version). CAS the whole 64-bit word.
- Hazard pointers: each thread publishes "I'm using ptr X." Memory not freed until no thread declares hazard.
- RCU: defer freeing until all readers active before the write are done.
- Garbage collection: in GC'd languages, the GC won't reuse addresses while references exist.
- CAS is the atomic primitive lock-free programming is built on.
- ABA is the most common lock-free bug — assume it can happen unless you've defended against it.
- Tagged pointers, hazard pointers, RCU, or GC each defeat ABA differently.
Lock-free needs a memory-reclamation strategy: when can old objects be freed if some thread might still be using them? RCU defers freeing until a "grace period" passes; hazard pointers track per-thread "in use" markers. Both prevent use-after-free in lock-free structures.
RCU (Read-Copy-Update)
Readers access shared data without locking. Writers create new copies and atomically swap pointers. Old copies freed only after all readers that started before the swap have finished (a "grace period"). Used heavily in the Linux kernel for read-heavy data (routing tables, file system mounts).
Hazard pointers
Per-thread pointers marking objects "in use." Memory reclamation defers freeing until no thread declares a hazard pointer to that object. Solves ABA and use-after-free in lock-free structures. More memory-efficient than RCU; more complex bookkeeping.
Linux kernel: RCU heavily (read-heavy data structures). Folly (Facebook C++): hazard pointers (general-purpose lock-free). Java/Go: garbage collector handles it (no manual reclamation needed).
- RCU = grace period + atomic pointer swap. Cheap reads, expensive writes.
- Hazard pointers = per-thread "in use" markers. More flexible, more complex.
- GC'd languages get this for free — at the cost of pause times.
LeetCode has a small set of "concurrency" problems specifically designed to drill thread coordination patterns. Anthropic and OpenAI loops have specifically referenced these. Drill all 8.
- LC 1114 Print in Order
- LC 1115 Print FooBar Alternately
- LC 1116 Print Zero Even Odd
- LC 1117 Building H2O
- LC 1188 Design Bounded Blocking Queue
- LC 1195 Fizz Buzz Multithreaded
- LC 1226 The Dining Philosophers
- LC 1242 Web Crawler Multithreaded
- Drill 1188 (Bounded Blocking Queue) and 1242 (Web Crawler) until you can do them in your sleep.
- The patterns repeat: semaphores, condition variables, atomic flags.
Concurrency questions cluster by company. Anthropic asks crawlers + thread-safe LRUs. OpenAI asks per-key locks. xAI says "simple and correct beats clever and broken." Stripe probes idempotency. Know the per-company patterns going in.
| Company | What they probe |
|---|---|
| Anthropic | Multithreaded web crawler; thread-safe LRU; race conditions in shared cache; rate limiter under contention |
| OpenAI | Time-based KV store with per-key locks; multithreaded crawler; toy interpreter (lex/parse/exec) |
| xAI | "Simple and correct beats clever and broken" — implement thread-safe queue/cache; CUDA kernel correctness |
| Cohere | Vector-DB upsert race; concurrent batch inference |
| Stripe | Idempotency under concurrent retries; webhook signature validator |
| Nvidia | Producer-consumer ring buffer; parallel matrix multiplication |
- Crawlers + thread-safe data structures dominate the canon.
- If you're asked "lock-free" — clarify whether they want lock-free or "without obvious races" first.
0 → hero concurrency path
- foundation OS: Three Easy Pieces — Part II (Concurrency). Free.
- foundation Java Concurrency in Practice (Goetz) — even if you don't use Java; concepts transfer.
- build Drill the LeetCode concurrency problems (1114-1242 above)
- build Implement a thread-safe LRU in Python with a single lock; then with striped locks. Compare contention.
- build Build a multithreaded web crawler in Python (the Anthropic question); then in C++ with proper memory ordering.
- depth C++ Concurrency in Action (Williams) — Ch 5 on memory model is essential
- depth Preshing on Programming — best memory-model essays on the web
- depth Python GIL — RealPython
- depth PEP 703 — free-threaded Python
- depth Python asyncio docs + RealPython asyncio guide
- depth CMU lock-free lecture
- depth LWN — What every programmer should know about memory (Ulrich Drepper)
Concurrency quiz — readiness check
- Implement a thread-safe LRU.
Show answer
Hashmap + doubly-linked list, single mutex around both. For higher concurrency: striped locks (per-bucket mutex). Java's old ConcurrentHashMap used this. Newer designs use lock-free with hazard pointers but they're production-grade complex.
- What does Python's GIL block exactly?
Show answer
Bytecode execution. Released during I/O syscalls and in C extensions that explicitly release it (NumPy, PyTorch CUDA). Doesn't help CPU-bound pure Python — use multiprocessing or PEP 703 free-threaded Python (3.13t).
- Diagnose a race in this code:
counter += 1from N threads.Show answer
Not atomic — load + add + store. Two threads can both load N, both store N+1, lose one increment. Fix: lock around the increment, or use
threading.Lock+with, or atomic counter (Python:itertools.count(); C++:std::atomic<int>). - Explain memory_order_acquire / release.
Show answer
Acquire load: prevents subsequent reads/writes from being reordered before this load. Release store: prevents prior reads/writes from being reordered after this store. Pair them to establish happens-before across threads (the SC-DRF model).
- Producer-consumer with bounded buffer.
Show answer
Mutex + two condition variables (not_full, not_empty) + a queue + capacity. put: acquire mutex, wait on not_full while size==cap, push, signal not_empty. take: acquire mutex, wait on not_empty while empty, pop, signal not_full. Both predicate-tested via while loop (spurious wakeups).
- What's the ABA problem?
Show answer
Thread reads value A; preempted; another thread changes A→B→A; first thread's CAS succeeds (sees A) but the world changed. Fix: tagged pointers (pointer + version counter), hazard pointers, RCU, garbage collection.
- Why striped lock instead of single mutex on a hashmap?
Show answer
Lock per bucket → reduces contention. Threads operating on different keys touch different mutexes, parallelize. Tradeoff: more memory, more complex resize. Java's old ConcurrentHashMap used 16 stripes by default.
- Implement a multithreaded web crawler.
Show answer
Thread pool of N workers. Shared thread-safe URL queue (deque + lock OR
queue.Queue). Shared visited set (with lock). Per-host rate limiter. Termination: "no more URLs and no in-flight tasks." Useconcurrent.futures.ThreadPoolExecutorin Python. - asyncio vs threading vs multiprocessing — pick one and explain.
Show answer
asyncio: many concurrent I/O operations, single thread, cooperative scheduling. Threading: I/O-bound or C-extension-heavy work where GIL is released. Multiprocessing: pure-CPU parallelism in Python. Free-threaded Python (3.13t) blurs the line.
- Why does a single sync call kill an asyncio event loop?
Show answer
The loop is single-threaded; while it executes a sync function, no other coroutine runs.
time.sleep(1), sync DB query,requests.get()— all freeze the loop. Useawait asyncio.sleep, async DB drivers,aiohttp; or wrap blocking calls inasyncio.to_thread(). - Coffman's four conditions — name them.
Show answer
(1) Mutual exclusion. (2) Hold and wait. (3) No preemption. (4) Circular wait. Break any one to prevent deadlock. Most common practical fix: lock ordering (always acquire in same global order).
- RWLock — when does it help vs hurt?
Show answer
Helps: read-heavy workloads with rare writes (caches). Many readers proceed in parallel. Hurts: write-heavy or balanced — RWLock has higher overhead than plain mutex due to read-counter management; writer can starve. Test under workload before assuming RWLock.
- Lock-free vs wait-free?
Show answer
Lock-free: at least one thread makes progress (system as a whole progresses). Wait-free: every thread makes progress in bounded steps (stronger). Most practical lock-free structures (Michael-Scott queue) are not wait-free; wait-free is rare due to complexity.
- What is RCU?
Show answer
Read-Copy-Update. Readers access shared data without locking. Writers create new copies and atomically swap pointers. Old copies freed only after all readers that started before the swap have finished. Used in Linux kernel for read-heavy data (routing tables, etc.).
- Why doesn't Python need
volatilelike Java/C++?Show answer
Python's GIL serializes bytecode execution → all reads/writes are effectively atomic + visible at bytecode boundaries. Without GIL (free-threaded 3.13t), this guarantee weakens; PEP 703 introduces atomic primitives. Java's
volatileensures visibility across threads; C++ usesstd::atomic. - What's the actor model?
Show answer
Concurrency primitive: actors are isolated entities with private state, communicate via async messages. No shared memory → no races. Used in Erlang, Akka. Tradeoffs: easier reasoning, but message-passing overhead and harder to handle some patterns (transactional updates).
- Implement double-checked locking correctly.
Show answer
Classic singleton init.
if (instance == null) { lock; if (instance == null) instance = new T(); unlock; }. Subtle bug in Java pre-5: instance write can be reordered before construction completes. Fix: declareinstancevolatile (Java) or usestd::atomicwith acquire/release (C++). Or just use static init (Java's class loader is thread-safe). - What does
memory_order_relaxedguarantee?Show answer
Atomicity only — no ordering with other operations. Use for counters where order doesn't matter. Reads/writes can be reordered by compiler / CPU. Faster than seq_cst on weak-memory architectures (ARM).
- Why is
std::shared_ptrthread-safe in some ways but not others?Show answer
The reference count is atomic (thread-safe to copy/destroy shared_ptr from multiple threads). But the pointed-to object's contents are not protected — accessing the same object from multiple threads still needs a mutex. Common bug source.
- What's the difference between condition variable's
waitand a busy loop?Show answer
Busy loop: thread spins, holding mutex or polling — wastes CPU. Condition variable: thread releases mutex, sleeps until
notify, atomically reacquires mutex on wake. Always use awhileloop aroundwaitto handle spurious wakeups.