SYSTEMS PILLAR · RISING DIFFERENTIATOR

Concurrency & threads

The 2025–26 frontier-lab differentiator. Anthropic, xAI, OpenAI all probe explicitly. This page teaches the GIL deep-dive, asyncio internals, lock-free patterns, and at least seeing a memory model — densely enough that one read covers an interview round.

Read ~30 min Asked at Anthropic, OpenAI, xAI, Cohere, Stripe, Nvidia Difficulty Sr / Staff bar
01
FOUNDATIONS · THE TOOL CHOICE

Threads vs processes vs coroutines — pick the right tool

TL;DR

Three concurrency primitives, three cost profiles. Coroutines win for I/O at scale (10k+ connections). Threads win for mixed CPU+I/O work in non-GIL languages or with C extensions. Processes win for true CPU parallelism in Python. Picking wrong = an order-of-magnitude perf hit.

ThreadProcessCoroutine (asyncio)
MemoryShared with peersIsolatedShared (single thread)
Switch costμs (kernel)ms (kernel)~ns (user)
Use caseI/O concurrency in non-GIL langs; mixed CPU/IOCPU parallelism in Python; isolationMassive I/O concurrency (10k+ conns)
HazardsRaces, deadlockIPC overheadSingle blocking call kills throughput

When to reach for threads

  • Mixed CPU + I/O
  • C extensions release the GIL (NumPy, PyTorch)
  • Non-Python languages (Java, Go, C++)

When to reach for asyncio

  • Lots of network I/O, low CPU
  • 10k+ concurrent connections
  • Web servers, scrapers, RPC clients
PITFALL — the most common mis-pick
Reaching for threads when async is better, or async when threads are better. Rule of thumb: network I/O at scale → asyncio. Mixed CPU+I/O with C-extension number crunching → threads. Pure CPU parallelism in Python → multiprocessing or free-threaded Python (3.13t).
REMEMBER
  • Coroutines are ~1000× cheaper to switch than threads.
  • Threads share memory (races possible); processes don't (need IPC).
  • One sync blocking call inside an asyncio loop kills the whole party.
02
FOUNDATIONS · THE SR STAFF BAR

Memory models — what the CPU and compiler don't promise

TL;DR

Modern CPUs and compilers reorder reads and writes for performance. Without explicit synchronization, what one thread sees from another can differ wildly from program order. Sequential consistency is the mental model you wish was true; in practice you write to x86 (TSO) or ARM (weak) and use atomics + memory orders to force the guarantees you need.

The four consistency points to know

C++ memory orders (memorize)

OrderWhat it guarantees
memory_order_seq_cst (default)Single total order across all SC operations.
memory_order_acquire (load)Subsequent reads/writes don't reorder before this load. Pairs with release.
memory_order_release (store)Prior reads/writes don't reorder after this store. Pairs with acquire.
memory_order_acq_rel (RMW)Both — for compare-exchange, fetch-add, etc.
memory_order_relaxedAtomicity only, no ordering with other ops. Cheapest. Use for stat counters.
memory_order_consumeRarely used. Treated as acquire by most compilers.
EXAMPLE — single-producer / single-consumer ring buffer with relaxed atomics

SPSC ring needs only acquire/release on the head/tail pointers — the data writes themselves can use relaxed because the index publication establishes happens-before. This is the canonical Anthropic concurrency probe.

REMEMBER
  • Without atomics + memory orders, the CPU and compiler reorder freely.
  • x86 ≠ ARM. Don't ship code that "works on my Mac" without testing on ARM.
  • Acquire/release pairs establish happens-before; relaxed is for unordered counters.
03
PRIMITIVES · LOCKS

Locks — the menu and when each fits

TL;DR

Mutex by default. Spinlock only if the critical section is microseconds. RWLock if reads dominate writes. Lock-free if you're willing to spend 10× the eng time for higher throughput. Recursive lock = a code smell signaling tangled control flow.

LockWhat it doesWhen to use
MutexBlocks waiters; OS-managedDefault for any non-trivial critical section
SpinlockBusy-wait until acquiredCritical section < ~1μs and contention rare
RWLockMany readers OR one writerRead-heavy workload (caches, config). Hurts under writes.
Recursive lockSame thread can re-acquireAlmost never — usually a sign the code is tangled
Lock-freeAt least one thread always progressesHot data structures (queues, hashmaps) where contention is the bottleneck
Wait-freeEvery thread progresses in bounded stepsReal-time / safety-critical. Rare due to complexity.
PITFALL — RWLock isn't always faster
RWLock has higher overhead than plain mutex (read-counter management). Under balanced or write-heavy workloads, plain mutex wins. Writer can starve under endless reads. Always benchmark before assuming RWLock helps.
REMEMBER
  • Default to mutex. Promote only when you have data showing it's the bottleneck.
  • Spinlock = bet that contention is rare AND critical sections are tiny.
  • Lock-free ≠ free. Wait-free is a strictly stronger guarantee almost no one needs.
04
CORRECTNESS · DEADLOCK

Coffman's four conditions — and how to break them

TL;DR

All four conditions must hold for deadlock to occur. Break any one to prevent it. The cheapest fix in production: global lock ordering — always acquire locks in the same canonical order across the codebase.

  1. Mutual exclusion — only one thread holds the resource.
  2. Hold and wait — a thread holds one resource while waiting for another.
  3. No preemption — resources can't be forcibly released.
  4. Circular wait — a cycle exists in the resource-wait graph.

Practical deadlock-prevention recipes:

REMEMBER
  • All four Coffman conditions required → break any one to prevent deadlock.
  • Global lock ordering is the cheapest fix that scales to large codebases.
  • Livelock (threads constantly retrying without progress) is harder to detect than deadlock.
05
CANONICAL PATTERN · CODE

Producer-consumer — the canonical pattern

TL;DR

Mutex + two condition variables (not_full, not_empty) + a queue + capacity. Always wait in a while loop (spurious wakeups). The canonical multi-thread interview drill — implement it cold in your sleep.

Python — using queue.Queue

import threading, queue

q = queue.Queue(maxsize=10)   # thread-safe, blocking on full/empty

def producer():
    for i in range(100):
        q.put(i)               # blocks if full
        print(f"produced {i}")

def consumer():
    while True:
        item = q.get()         # blocks if empty
        print(f"consumed {item}")
        q.task_done()

threading.Thread(target=producer, daemon=True).start()
threading.Thread(target=consumer, daemon=True).start()
q.join()

C++ — explicit condition variables

#include <mutex>
#include <condition_variable>
#include <queue>

template<typename T>
class BoundedQueue {
    std::mutex m;
    std::condition_variable not_full, not_empty;
    std::queue<T> q;
    size_t cap;
public:
    BoundedQueue(size_t c) : cap(c) {}

    void put(T x) {
        std::unique_lock lk(m);
        not_full.wait(lk, [&]{ return q.size() < cap; });
        q.push(std::move(x));
        not_empty.notify_one();
    }

    T take() {
        std::unique_lock lk(m);
        not_empty.wait(lk, [&]{ return !q.empty(); });
        T x = std::move(q.front()); q.pop();
        not_full.notify_one();
        return x;
    }
};
EXAMPLE — extending to multi-producer / multi-consumer

The above already works for N producers and M consumers — the mutex serializes access; condition variables wake exactly one waiter per notify. Use notify_all if your predicate change might unblock multiple waiters (e.g., adding K items at once instead of one).

PITFALL — the if vs while bug
Always wrap cv.wait in a while (!predicate()) loop, never if. Spurious wakeups are real (POSIX permits them). With if, your code wakes up in an invalid state and crashes.
REMEMBER
  • Mutex + 2 CVs + while-predicate-wait. That's the pattern.
  • notify_one when one waiter can make progress; notify_all when many might.
  • Python: just use queue.Queue unless you need custom predicates.
06
PYTHON DEEP DIVE · GIL

Python GIL — what it does, what it doesn't, free-threaded Python

TL;DR

The GIL serializes Python bytecode execution: only one thread runs Python at a time. It's released during I/O syscalls and in C extensions that ask for release (NumPy, PyTorch CUDA). It does NOT help CPU-bound pure Python — for that, use multiprocessing or free-threaded Python (3.13t, PEP 703).

THE INSIGHT — what the GIL actually blocks

It's not "Python isn't multi-threaded" — it's a specific lock with specific release rules

  • Released during I/O: read(), send(), sleep(), file I/O, network I/O — threading helps for I/O-bound work.
  • Released in C extensions that ask: NumPy, PyTorch CUDA, lots of native libraries — threading helps for those too.
  • Doesn't help CPU-bound pure Python: for i in range(N): x += 1 — single-thread serialized. Use multiprocessing.
  • Free-threaded Python (3.13t, 2024): experimental GIL-disabled build. ~10–30% single-thread overhead, but true parallel Python on multiple cores. Not yet default; library compatibility is growing fast (NumPy, PyTorch, Pandas already work).
  • Subinterpreters (3.12+): each interpreter has its own GIL; can run truly in parallel within one process. Communication via shared memory.

What's atomic in Python?

Single bytecode ops are atomic (e.g., list.append, dict[k] = v). But x += 1 is not — it's load + add + store, three separate bytecodes. Two threads can both load, both add, both store — losing one increment.

PITFALL — "Python is thread-safe so I don't need locks"
The GIL prevents corruption of internal CPython data structures (the C-level invariants). It does NOT prevent your Python-level race conditions. counter += 1 from N threads still loses updates without a lock.
REMEMBER
  • GIL = serializes bytecode, not the world.
  • Released during I/O syscalls and in cooperating C extensions.
  • For CPU-bound Python: multiprocessing or free-threaded 3.13t.
  • Read-modify-write at the Python level still races without a lock.
07
PYTHON DEEP DIVE · ASYNCIO

asyncio internals — the event loop and its sharp edges

TL;DR

Single-threaded event loop. await yields control back to the loop. asyncio.gather runs coroutines concurrently. The fatal failure mode: a single sync blocking call freezes the entire loop and every other coroutine.

The mental model

import asyncio

async def fetch(url):
    print(f"start {url}")
    await asyncio.sleep(1)         # simulated I/O — yields control to loop
    print(f"done {url}")
    return url

async def main():
    results = await asyncio.gather(*(fetch(u) for u in urls))
    return results

asyncio.run(main())
CRITICAL — single sync call kills the loop

Why a sync call freezes every other coroutine

The loop is single-threaded. While it executes a sync function, no other coroutine runs. One time.sleep(1), sync DB query, or requests.get() freezes the entire loop for 1 second — every other concurrent task pauses too.

Fix: use async equivalents (await asyncio.sleep, async DB drivers like asyncpg, aiohttp instead of requests); or wrap blocking calls in asyncio.to_thread() / loop.run_in_executor().

REMEMBER
  • Event loop = single thread + selector (epoll/kqueue) + cooperative scheduling.
  • Every await is a yield point. Anything between two awaits runs uninterrupted.
  • Sync blocking calls inside async = catastrophic. Use to_thread() if you must.
08
PYTHON DEEP DIVE · TRUE PARALLELISM

Multiprocessing & shared memory — true parallelism in Python

TL;DR

Multiprocessing gives you actual parallelism in Python by spawning OS processes. The bottleneck is data transfer — pickle round-trips can dwarf the compute. Use multiprocessing.shared_memory for zero-copy NumPy sharing.

from multiprocessing import shared_memory
import numpy as np

# Producer
shm = shared_memory.SharedMemory(create=True, size=4 * 1024 * 1024)
arr = np.ndarray((1024, 1024), dtype=np.float32, buffer=shm.buf)
arr[:] = np.random.randn(1024, 1024)
print(shm.name)   # pass to consumer

# Consumer (different process)
shm = shared_memory.SharedMemory(name="...")
arr = np.ndarray((1024, 1024), dtype=np.float32, buffer=shm.buf)
# zero-copy access; no pickle round-trip
EXAMPLE — when multiprocessing wins

Pure-Python computation on independent chunks of data. Image preprocessing, simulations, hyperparam sweeps. Loses when data transfer dominates — e.g., transferring a 1GB array to children that do 10ms of work each.

REMEMBER
  • Multiprocessing = real parallel Python by spawning OS processes.
  • Pickle is the hidden cost. Use shared_memory for large arrays.
  • Consider Ray / Dask / joblib for production work — they handle the boilerplate.
09
LOCK-FREE · THE CORE PRIMITIVE

Compare-and-swap, ABA, lock-free patterns

TL;DR

CAS(addr, expected, new) atomically sets a value if it matches expected. The foundation of lock-free programming. The classic trap is the ABA problem — a value can change A → B → A while you're preempted, and your CAS will succeed on stale assumptions.

CAS — the building block

CAS(addr, expected, new): atomically set *addr = new if *addr == expected. Returns success/failure. Hardware provides this via CMPXCHG (x86) or LDREX/STREX (ARM). Foundation of lock-free queues, hashmaps, ref counts.

The ABA problem

Thread T1 reads A, gets preempted. Thread T2 changes A → B → A. T1 resumes, runs CAS expecting A — it succeeds, but the world changed underneath. T1's logic may now be invalid (e.g., the node it pointed to was freed and the address reused).

PITFALL — ABA bites lock-free queues hard
Michael-Scott queue without ABA protection: pop() reads head pointer, gets preempted; another thread pops + pushes a different node that happens to land at the same address; original pop's CAS succeeds and silently corrupts the queue.

Mitigations:

REMEMBER
  • CAS is the atomic primitive lock-free programming is built on.
  • ABA is the most common lock-free bug — assume it can happen unless you've defended against it.
  • Tagged pointers, hazard pointers, RCU, or GC each defeat ABA differently.
10
LOCK-FREE · ADVANCED RECLAMATION

RCU and hazard pointers — advanced reclamation

TL;DR

Lock-free needs a memory-reclamation strategy: when can old objects be freed if some thread might still be using them? RCU defers freeing until a "grace period" passes; hazard pointers track per-thread "in use" markers. Both prevent use-after-free in lock-free structures.

RCU (Read-Copy-Update)

Readers access shared data without locking. Writers create new copies and atomically swap pointers. Old copies freed only after all readers that started before the swap have finished (a "grace period"). Used heavily in the Linux kernel for read-heavy data (routing tables, file system mounts).

Hazard pointers

Per-thread pointers marking objects "in use." Memory reclamation defers freeing until no thread declares a hazard pointer to that object. Solves ABA and use-after-free in lock-free structures. More memory-efficient than RCU; more complex bookkeeping.

EXAMPLE — what real systems pick

Linux kernel: RCU heavily (read-heavy data structures). Folly (Facebook C++): hazard pointers (general-purpose lock-free). Java/Go: garbage collector handles it (no manual reclamation needed).

REMEMBER
  • RCU = grace period + atomic pointer swap. Cheap reads, expensive writes.
  • Hazard pointers = per-thread "in use" markers. More flexible, more complex.
  • GC'd languages get this for free — at the cost of pause times.
11
DRILLS · LEETCODE

Concurrency LeetCodes — the labs ask these

TL;DR

LeetCode has a small set of "concurrency" problems specifically designed to drill thread coordination patterns. Anthropic and OpenAI loops have specifically referenced these. Drill all 8.

REMEMBER
  • Drill 1188 (Bounded Blocking Queue) and 1242 (Web Crawler) until you can do them in your sleep.
  • The patterns repeat: semaphores, condition variables, atomic flags.
12
FIELD INTEL · WHAT COMPANIES PROBE

What companies actually probe

TL;DR

Concurrency questions cluster by company. Anthropic asks crawlers + thread-safe LRUs. OpenAI asks per-key locks. xAI says "simple and correct beats clever and broken." Stripe probes idempotency. Know the per-company patterns going in.

CompanyWhat they probe
AnthropicMultithreaded web crawler; thread-safe LRU; race conditions in shared cache; rate limiter under contention
OpenAITime-based KV store with per-key locks; multithreaded crawler; toy interpreter (lex/parse/exec)
xAI"Simple and correct beats clever and broken" — implement thread-safe queue/cache; CUDA kernel correctness
CohereVector-DB upsert race; concurrent batch inference
StripeIdempotency under concurrent retries; webhook signature validator
NvidiaProducer-consumer ring buffer; parallel matrix multiplication
REMEMBER
  • Crawlers + thread-safe data structures dominate the canon.
  • If you're asked "lock-free" — clarify whether they want lock-free or "without obvious races" first.

0 → hero concurrency path

  1. foundation OS: Three Easy Pieces — Part II (Concurrency). Free.
  2. foundation Java Concurrency in Practice (Goetz) — even if you don't use Java; concepts transfer.
  3. build Drill the LeetCode concurrency problems (1114-1242 above)
  4. build Implement a thread-safe LRU in Python with a single lock; then with striped locks. Compare contention.
  5. build Build a multithreaded web crawler in Python (the Anthropic question); then in C++ with proper memory ordering.
  6. depth C++ Concurrency in Action (Williams) — Ch 5 on memory model is essential
  7. depth Preshing on Programming — best memory-model essays on the web
  8. depth Python GIL — RealPython
  9. depth PEP 703 — free-threaded Python
  10. depth Python asyncio docs + RealPython asyncio guide
  11. depth CMU lock-free lecture
  12. depth LWN — What every programmer should know about memory (Ulrich Drepper)

Concurrency quiz — readiness check

  1. Implement a thread-safe LRU.
    Show answer

    Hashmap + doubly-linked list, single mutex around both. For higher concurrency: striped locks (per-bucket mutex). Java's old ConcurrentHashMap used this. Newer designs use lock-free with hazard pointers but they're production-grade complex.

  2. What does Python's GIL block exactly?
    Show answer

    Bytecode execution. Released during I/O syscalls and in C extensions that explicitly release it (NumPy, PyTorch CUDA). Doesn't help CPU-bound pure Python — use multiprocessing or PEP 703 free-threaded Python (3.13t).

  3. Diagnose a race in this code: counter += 1 from N threads.
    Show answer

    Not atomic — load + add + store. Two threads can both load N, both store N+1, lose one increment. Fix: lock around the increment, or use threading.Lock + with, or atomic counter (Python: itertools.count(); C++: std::atomic<int>).

  4. Explain memory_order_acquire / release.
    Show answer

    Acquire load: prevents subsequent reads/writes from being reordered before this load. Release store: prevents prior reads/writes from being reordered after this store. Pair them to establish happens-before across threads (the SC-DRF model).

  5. Producer-consumer with bounded buffer.
    Show answer

    Mutex + two condition variables (not_full, not_empty) + a queue + capacity. put: acquire mutex, wait on not_full while size==cap, push, signal not_empty. take: acquire mutex, wait on not_empty while empty, pop, signal not_full. Both predicate-tested via while loop (spurious wakeups).

  6. What's the ABA problem?
    Show answer

    Thread reads value A; preempted; another thread changes A→B→A; first thread's CAS succeeds (sees A) but the world changed. Fix: tagged pointers (pointer + version counter), hazard pointers, RCU, garbage collection.

  7. Why striped lock instead of single mutex on a hashmap?
    Show answer

    Lock per bucket → reduces contention. Threads operating on different keys touch different mutexes, parallelize. Tradeoff: more memory, more complex resize. Java's old ConcurrentHashMap used 16 stripes by default.

  8. Implement a multithreaded web crawler.
    Show answer

    Thread pool of N workers. Shared thread-safe URL queue (deque + lock OR queue.Queue). Shared visited set (with lock). Per-host rate limiter. Termination: "no more URLs and no in-flight tasks." Use concurrent.futures.ThreadPoolExecutor in Python.

  9. asyncio vs threading vs multiprocessing — pick one and explain.
    Show answer

    asyncio: many concurrent I/O operations, single thread, cooperative scheduling. Threading: I/O-bound or C-extension-heavy work where GIL is released. Multiprocessing: pure-CPU parallelism in Python. Free-threaded Python (3.13t) blurs the line.

  10. Why does a single sync call kill an asyncio event loop?
    Show answer

    The loop is single-threaded; while it executes a sync function, no other coroutine runs. time.sleep(1), sync DB query, requests.get() — all freeze the loop. Use await asyncio.sleep, async DB drivers, aiohttp; or wrap blocking calls in asyncio.to_thread().

  11. Coffman's four conditions — name them.
    Show answer

    (1) Mutual exclusion. (2) Hold and wait. (3) No preemption. (4) Circular wait. Break any one to prevent deadlock. Most common practical fix: lock ordering (always acquire in same global order).

  12. RWLock — when does it help vs hurt?
    Show answer

    Helps: read-heavy workloads with rare writes (caches). Many readers proceed in parallel. Hurts: write-heavy or balanced — RWLock has higher overhead than plain mutex due to read-counter management; writer can starve. Test under workload before assuming RWLock.

  13. Lock-free vs wait-free?
    Show answer

    Lock-free: at least one thread makes progress (system as a whole progresses). Wait-free: every thread makes progress in bounded steps (stronger). Most practical lock-free structures (Michael-Scott queue) are not wait-free; wait-free is rare due to complexity.

  14. What is RCU?
    Show answer

    Read-Copy-Update. Readers access shared data without locking. Writers create new copies and atomically swap pointers. Old copies freed only after all readers that started before the swap have finished. Used in Linux kernel for read-heavy data (routing tables, etc.).

  15. Why doesn't Python need volatile like Java/C++?
    Show answer

    Python's GIL serializes bytecode execution → all reads/writes are effectively atomic + visible at bytecode boundaries. Without GIL (free-threaded 3.13t), this guarantee weakens; PEP 703 introduces atomic primitives. Java's volatile ensures visibility across threads; C++ uses std::atomic.

  16. What's the actor model?
    Show answer

    Concurrency primitive: actors are isolated entities with private state, communicate via async messages. No shared memory → no races. Used in Erlang, Akka. Tradeoffs: easier reasoning, but message-passing overhead and harder to handle some patterns (transactional updates).

  17. Implement double-checked locking correctly.
    Show answer

    Classic singleton init. if (instance == null) { lock; if (instance == null) instance = new T(); unlock; }. Subtle bug in Java pre-5: instance write can be reordered before construction completes. Fix: declare instance volatile (Java) or use std::atomic with acquire/release (C++). Or just use static init (Java's class loader is thread-safe).

  18. What does memory_order_relaxed guarantee?
    Show answer

    Atomicity only — no ordering with other operations. Use for counters where order doesn't matter. Reads/writes can be reordered by compiler / CPU. Faster than seq_cst on weak-memory architectures (ARM).

  19. Why is std::shared_ptr thread-safe in some ways but not others?
    Show answer

    The reference count is atomic (thread-safe to copy/destroy shared_ptr from multiple threads). But the pointed-to object's contents are not protected — accessing the same object from multiple threads still needs a mutex. Common bug source.

  20. What's the difference between condition variable's wait and a busy loop?
    Show answer

    Busy loop: thread spins, holding mutex or polling — wastes CPU. Condition variable: thread releases mutex, sleeps until notify, atomically reacquires mutex on wake. Always use a while loop around wait to handle spurious wakeups.