Distributed systems
Every frontier-lab loop probes this, and it's where you've said you're weakest. This chapter teaches the lens (CAP/PACELC), the impossibility result (FLP), the building blocks (consensus, quorums, time, replication), and the canonical systems (Spanner, Dynamo, Kafka, S3) tightly enough that you can answer any senior-staff question with first principles instead of memorized trivia.
What you'll learn
- CAP & PACELC — the lens that organizes everything
- The impossibility result — FLP and what it forced us to do
- Consistency lattice — linearizable to eventual
- Time in distributed systems — Lamport, vector, HLC, TrueTime
- Distributed transactions — 2PC, sagas, when each works
- Consensus — Paxos vs Raft vs ZAB
- Quorums & consistent hashing — how data finds its replica
- Replication topologies — single-leader to leaderless
- CRDTs — convergence without coordination
- Failure detection — heartbeats, phi-accrual, SWIM
- Storage engines — LSM vs B-tree, the canonical choice
- The systems you must know — Spanner, Dynamo, BigTable, Kafka, S3
- Service architecture — load balancing, caching, rate limiting
- Networking quick reference — TCP, QUIC, gRPC
- The 10 must-know papers
CAP is not "pick 2 of 3." Network partitions are non-negotiable in real systems, so the only real choice is during a partition, do you sacrifice consistency (AP) or availability (CP)? PACELC adds the missing dimension: else (no partition), do you favor latency (EL) or consistency (EC)? Every distributed database in the wild can be placed on this 2×2.
CAP, stated correctly
Brewer 2000, formalized by Gilbert & Lynch 2002. In the presence of a network partition, you must choose between consistency (linearizability — every read sees the latest write) and availability (every non-failed node responds successfully). You cannot have both during a partition.
CAP is a partition-time choice, not a design-time menu
"Pick 2 of 3" is the wrong framing. P (partition tolerance) is not optional in real distributed systems — networks fail. So you are choosing between CP and AP during the partition. Outside of partitions, you can have both C and A. This single correction is the most common entry-level filter at Google, Stripe, and Anthropic.
PACELC — the missing dimension
Abadi 2012. CAP only describes behavior during a partition; PACELC asks what you do the rest of the time. Partitioned → Availability or Consistency; Else (no partition) → Latency or Consistency. Most real systems trade consistency for latency even in the steady state.
| System | P | EL/EC | Why |
|---|---|---|---|
| Spanner | CP | EC | Pays in latency for global linearizability via TrueTime commit-wait. |
| Dynamo / Cassandra | AP | EL | Always answers; tunable weak reads (LOCAL_QUORUM keeps it sane). |
| HBase / BigTable | CP | EC | Single tablet server per range; reads block on lease handoff. |
| MongoDB (default) | CP-ish | EC | Primary-only writes; reads can be relaxed via readPreference. |
| CockroachDB | CP | EC | Spanner-style without atomic clocks; HLC + max offset. |
- P is forced. The real CAP choice is CP vs AP during a partition.
- PACELC is what you do the other 99% of the time. EL vs EC.
- Spanner = CP/EC. Dynamo = AP/EL. Place every system you discuss on this matrix.
Fischer, Lynch, Paterson (1985): in a fully asynchronous network with even one crash failure, no deterministic consensus protocol can guarantee both safety and liveness. Real systems sidestep with timeouts (semi-synchrony), randomization, or by accepting rare unavailability. Raft and Paxos always pick safety over liveness when forced.
What "asynchronous" means here
Asynchronous = no upper bound on message delay or processing time. You cannot tell a slow node from a dead one. Given that, the FLP proof shows there is always an execution where consensus runs forever without deciding.
How real systems escape FLP
- Failure detectors / timeouts — assume partial synchrony. Most-of-the-time messages arrive within a bound; treat slow nodes as failed and elect new leaders. Raft and Paxos do this.
- Randomization — Ben-Or 1983. Coin flips break symmetry; expected (not bounded) termination.
- Accept rare unavailability — under genuine asymmetric network conditions (true split-brain), Raft halts rather than violating safety. Liveness yielded for safety.
- FLP: async + 1 crash + deterministic = no protocol can guarantee safety AND liveness.
- Sidesteps: timeouts (semi-synchrony), randomization, or accept halts.
- Raft/Paxos give up liveness during pathological partitions; never give up safety.
Two lattices to keep straight: a replication consistency lattice (linearizable → sequential → causal → eventual) and a transaction isolation lattice (serializable → snapshot isolation → read committed → read uncommitted). Strict serializable = serializable + linearizable; that's what Spanner gives. Snapshot isolation looks safe but lets write skew through.
Replication consistency, strongest first
- Linearizability — each op appears to take effect at one instant between invocation and response, consistent with real time. The illusion of a single copy.
- Sequential consistency — total order consistent with each process's program order, but no real-time guarantee.
- Causal consistency — causally related ops are seen in the same order by all observers; concurrent ops can differ.
- Eventual consistency — replicas converge if updates stop. No bound on when.
Transaction isolation, strongest first
- Serializable — outcome equivalent to some serial execution.
- Snapshot isolation (SI) — each tx reads a consistent snapshot at start time; first-writer-wins for conflicting writes.
- Read committed — never read uncommitted data, but reads inside one tx may see different values.
- Read uncommitted — anything goes.
Strict serializable = serializable + linearizable. Spanner provides this; PostgreSQL "serializable" is SSI (Cahill 2008), which is serializable but not linearizable.
SELECT ... FOR UPDATE. Postgres "serializable" catches this; MySQL "serializable" doesn't really.
- Linearizable is about real time across replicas; serializable is about some serial order.
- Strict serializable = both. Spanner.
- Snapshot isolation lets write skew through. SSI fixes it via predicate locks + dependency tracking.
Wall-clock time is unreliable across machines (NTP ~10ms drift, PTP <1ms). Logical clocks (Lamport, vector) give causal ordering without trusting hardware. Hybrid logical clocks (CockroachDB) add a physical anchor with bounded skew. TrueTime (Spanner) inverts the problem: GPS + atomic clocks give a bounded uncertainty interval, and you sleep through it (commit-wait) to make external consistency cheap.
The four clocks you must know
| Clock | What it gives | Cost | Used by |
|---|---|---|---|
| Physical (NTP/PTP) | Wall time, ~10ms / <1ms drift | Cheap | Logging, anything not safety-critical |
| Lamport | Total order, partial causality | 1 counter / event | Educational; rarely used directly |
| Vector clock | Full causality (a → b iff VC(a) < VC(b)) | O(N) per event | Dynamo, Riak; conflict detection |
| HLC | Logical + bounded physical anchor | ~16 bytes | CockroachDB, MongoDB |
| TrueTime | Bounded uncertainty interval [earliest, latest] | GPS + atomic clocks in every DC | Spanner |
Lamport clocks — the rule
On every event: L = max(L_local, L_received) + 1. Stamps every message with a counter. Gives a total order, but loses concurrency information (you can't tell from Lamport alone whether two events were causally related or independent).
Vector clocks — full causality
Each process maintains an N-element array. On local event: bump own slot. On send: piggyback the array. On receive: take element-wise max, bump own slot. Compare by element-wise <.
A transaction T1 wants to commit at now(). TrueTime returns [earliest=10:00:00.000, latest=10:00:00.007] — uncertainty 7ms. Spanner picks commit_ts = latest = 10:00:00.007, then sleeps until TrueTime.now().earliest > 10:00:00.007 (roughly 7ms). Only then releases the commit. Why: any later transaction T2 reading the database now sees a now().earliest strictly greater than T1.commit_ts, so it cannot be assigned an earlier timestamp — external consistency holds globally without coordination between datacenters.
- Lamport = total order, no causality. Vector = causality, O(N) cost.
- HLC = practical compromise. CockroachDB.
- TrueTime = bound uncertainty, then wait through it. Spanner's killer move.
2PC is correct but blocks if the coordinator dies after prepare. 3PC fixes blocking but assumes synchronous networks (which don't exist). Paxos Commit replicates the coordinator. Sagas drop ACID entirely — long-running flows split into compensable steps, no distributed lock. Real systems: 2PC for short cross-shard tx (Spanner does 2PC over Paxos), sagas for cross-service workflows (orders, payments).
2PC — prepare, then commit
- Coordinator sends PREPARE to all participants.
- Each participant locks resources, writes a prepare record to its log, votes YES or NO.
- If all YES, coordinator writes COMMIT to its log, sends COMMIT. If any NO or timeout, sends ABORT.
- Participants apply and release locks.
The blocking problem: if the coordinator crashes after participants vote YES but before sending COMMIT/ABORT, participants are stuck holding locks indefinitely. They cannot decide unilaterally because they don't know what the coordinator told the others.
3PC and Paxos Commit
3PC adds a pre-commit phase so participants can recover safely if the coordinator dies — but it only works under synchronous network assumptions, which is why it's almost never deployed. Paxos Commit (Gray & Lamport 2006) is the production answer: replicate the coordinator's state machine via Paxos. Spanner does this — each shard is a Paxos group, and 2PC runs across groups.
Sagas — the alternative for long flows
(Garcia-Molina & Salem 1987.) Break a long-running tx into N local transactions, each with a compensating action (refund, cancel, release inventory). On failure mid-flow, run compensations for completed steps in reverse. No distributed lock. Caveat: not isolated — other readers see intermediate states.
Use 2PC when
- Cross-shard tx within one trusted system (Spanner, CockroachDB).
- Short critical sections; locks held for ms.
- You can run a Paxos-replicated coordinator.
Use sagas when
- Cross-service workflow (order → payment → shipping).
- Steps run for seconds to days.
- You can write a compensating action for every step.
- 2PC: correct, blocks if coordinator dies after prepare. Used over Paxos in Spanner.
- 3PC: rarely deployed — assumes synchrony.
- Sagas: cross-service workflows; every step needs a compensating action.
Paxos is the original; correct, fast, infamously hard to implement. Raft (2014) is Paxos restructured for understandability — same guarantees, simpler mental model, dominant in 2020s. ZAB is ZooKeeper's variant (primary-backup with total order). Build Raft yourself in MIT 6.824 Lab 2 — it's the single highest-leverage exercise for this entire pillar.
Paxos in one paragraph
Two phases. Prepare: a proposer picks a unique number n, asks acceptors to "promise" not to accept anything < n; if it gets a majority of promises, it learns the highest-numbered already-accepted value. Accept: it asks the same majority to accept (n, value); if all promise still holds, the value is chosen. Quorum (majority) ensures any two rounds intersect. Multi-Paxos elides the prepare phase once a leader is stable — the typical case.
Raft — the modern default
Ongaro & Ousterhout, USENIX ATC 2014. Designed top-down for understandability. Splits consensus into three sub-problems: leader election, log replication, safety.
Memorize these and you can answer any Raft question
- Elect by majority with random timeouts. Followers become candidates on timeout, request votes; whoever wins a majority (in a term) is leader. Random timeouts prevent split votes.
- Replicate the log; commit by majority. Leader appends entries, sends
AppendEntriesto followers; once a majority has stored an entry, it is committed and applied to the state machine. - Leader Completeness: a leader's log contains all committed entries; only commit entries from the current term. The "only commit current-term entries" rule is the subtle one — it prevents an old leader's entry from being overwritten after re-election. It took 6 months to nail this down in the original paper.
Membership changes: joint consensus — overlap old and new configurations during the transition so any majority of old AND any majority of new must overlap, preserving safety.
Build it yourself
Implement Raft in MIT 6.824 Lab 2 (Go). The single most useful exercise for distributed-systems interviews. After Lab 2 + Lab 3 (KVStore on Raft), you can answer Raft questions from first principles instead of recall.
ZAB — ZooKeeper Atomic Broadcast
Primary-backup with total order. Like Raft, has leader election and log replication, but ZAB explicitly preserves order of all messages from the primary across reconfiguration. Used by ZooKeeper; its semantics (linearizable writes, sequentially-consistent reads, watches) are the basis for Chubby-style coordination services.
- Raft = leader election (random timeouts) + log replication (majority commit) + Leader Completeness (only commit current term).
- Multi-Paxos = Paxos with stable leader, prepare-elision; equivalent to Raft in practice.
- ZAB powers ZooKeeper. Chubby uses Multi-Paxos.
- Implement Raft yourself. There is no substitute.
Quorum math: R + W > N ⇒ every read intersects every write ⇒ strong-ish consistency. Consistent hashing puts keys and nodes on a ring; adding a node moves only K/N keys instead of nearly all of them. Jump hash (O(log n), no node IDs) and rendezvous (HRW, weight-friendly) are the two refinements you should know.
Quorum intuition
With N replicas, R read replicas, W write replicas: if R + W > N, every read set and every write set share at least one node (pigeonhole). So a read will see at least one replica that has the latest write. Typical balanced choice: R = W = ⌈N/2⌉ + 1.
Consistent hashing
Hash both keys and nodes onto a ring (e.g., 0..2^32). Each key is owned by the next clockwise node. Adding or removing a node only moves K/N keys, vs nearly all under naive modulo hashing. Virtual nodes (each physical node owns many ring positions) smooth out load when N is small.
Two refinements you should know
- Jump hash (Lamping & Veach 2014) — O(log n) compute, no node IDs needed, no memory state. Limitation: only works for "node count goes from N to N+1" — you can't remove arbitrary nodes mid-ring. Good for stateless services with a known cluster size.
- Rendezvous (HRW) hashing — for each key, hash
(key, node_id)for every node; pick the node with the max hash. Same K/N rebalance property as consistent hashing. Trivially supports weighted nodes (multiply hash by weight).
R + W > N guarantee — your read might land on the strict replicas while the write went to backup nodes. Don't claim strong consistency on a sloppy-quorum system.
- R + W > N ⇒ every read intersects every write.
- Consistent hashing: K/N keys move on add/remove. Virtual nodes for small N.
- Jump hash: O(log n), stateless, but rigid. HRW: same property + weights.
- Sloppy quorum ≠ quorum. Don't conflate.
Four topologies: single-leader (Postgres, MySQL — simple, write throughput bounded by leader), multi-leader (Cassandra-EACH_QUORUM, multi-region — needs conflict resolution), leaderless (Dynamo — sloppy quorums, gossip), chain replication (head→tail — strong consistency, simple recovery, used in object stores like Microsoft Azure FCRC). Sync vs async replication is the fundamental availability/durability dial.
The four topologies
- Single-leader: writes go to the leader, replicate to followers. Sync replication = no data loss but writes block on slowest follower; async = fast but you can lose recently-acked writes on failover. Most production setups use one sync follower (durability) and N async (read scaling).
- Multi-leader: writes can go to any leader; conflicts must be resolved (LWW, CRDTs, or app logic). Useful for multi-datacenter writes when you cannot tolerate cross-region write latency.
- Leaderless (Dynamo-style): any replica accepts writes; gossip + sloppy quorum + hinted handoff for availability; read-repair on conflict; anti-entropy via Merkle tree comparison.
- Chain replication (van Renesse & Schneider 2004): writes flow head → ... → tail; reads from tail. Strong consistency with simple failure recovery (just shorten/extend the chain).
Sync vs async — the durability dial
Sync replication: leader waits for followers to ack before returning OK to the client. No data loss on leader crash, but availability drops if a follower lags. Async: leader returns OK after local write; recently-acked writes can be lost if the leader dies before replication catches up. Semi-sync (MySQL, Postgres): wait for at least one follower to ack — usually the right tradeoff.
- Single-leader is the default; sync to one follower, async to the rest.
- Multi-leader needs conflict resolution. Don't use without one.
- Leaderless = Dynamo. Tunable consistency via R, W.
- Chain replication: strong consistency, simple recovery — underrated.
CRDTs are data structures whose merge is commutative, associative, and idempotent — replicas converge no matter the order of updates. G-Counter (sum of per-replica counters), OR-Set (tagged adds, tombstone-aware deletes), LWW-Register (timestamp wins), RGA (collaborative text). Used in Figma, Linear, Riak, and offline-first apps. The price: weaker semantics than serializable, more storage (tombstones).
The four you should know cold
- G-Counter (grow-only) — array of per-replica counters; increment your own slot; read = sum; merge = element-wise max.
- PN-Counter — two G-Counters (one for increments, one for decrements); read = inc - dec.
- OR-Set (observed-remove) — every add tags the element with a unique ID; remove only erases tags you have actually observed. Concurrent add+remove resolves to "present."
- LWW-Register — timestamped value; latest timestamp wins. Loses concurrent writes — use only when "last write wins" is semantically OK.
- RGA (replicated growable array) — text editing CRDT; each character has a unique ID and reference to its predecessor; insertion is commutative.
CRDT vs OT
Operational Transform (Google Docs) achieves the same goal differently: transform concurrent ops to commute. Requires a server to linearize. CRDTs converge peer-to-peer without one. OT is dominant historically; CRDTs dominate new collaborative apps because they're easier to reason about offline.
- CRDT = merge is commutative + associative + idempotent ⇒ converges.
- G-Counter, OR-Set, LWW-Register, RGA — name and explain each.
- CRDTs trade serializability for coordination-free availability.
You can never tell a slow node from a dead one (FLP again). The job of a failure detector is to bound the wrongness. Heartbeats give a binary alive/dead signal; phi-accrual gives a continuous suspicion level adapting to network jitter; SWIM is gossip-based for thousands of nodes.
The three you should know
- Heartbeats — periodic ping with timeout. Simple, but the timeout choice is brutal: too short and you false-positive; too long and you take forever to detect a real failure.
- Phi-accrual (Hayashibara 2004) — instead of binary, output a continuous suspicion level
φ = -log₁₀(probability of mistake). App tunes a threshold. Adapts to observed inter-arrival distribution. Used in Cassandra, Akka. - SWIM (Das 2002) — Scalable Weakly-consistent Infection-style membership. Each node periodically pings a random peer; on failure, asks K random peers to indirectly probe. Gossips suspicions and confirmations. Scales to thousands of nodes. Used in HashiCorp Serf, Memberlist, Consul.
- Heartbeat = binary, simple, hard to tune.
- Phi-accrual = continuous, adaptive, app picks threshold. Cassandra default.
- SWIM = gossip-based, scales to thousands of nodes.
Two storage philosophies: LSM trees (RocksDB, Cassandra, BigTable) optimize writes by appending to a log and merging in the background — fast writes, slower reads, needs compaction. B-trees (InnoDB, Postgres, BoltDB) update in place — fast reads, slower random writes, no compaction. LSM dominates write-heavy KV stores; B-tree dominates OLTP.
How LSM works in 4 lines
- Writes go to a memtable (sorted in-memory) + a write-ahead log on disk.
- When the memtable fills, it flushes to an SSTable (sorted, immutable file) at level 0.
- Background compaction merges SSTables across levels (size-tiered or leveled).
- Reads: check memtable, then each level; bloom filters skip levels that can't contain the key.
LSM tree
- Writes: fast (sequential append).
- Reads: slower (multi-level + bloom).
- Space amp: higher (multiple levels, tombstones).
- Compaction: required, background.
- Used by: RocksDB, Cassandra, BigTable, ScyllaDB, LevelDB.
B-tree
- Writes: slower (random IO).
- Reads: fast (single tree walk).
- Space amp: lower (in-place updates).
- Compaction: not needed; vacuum / page splits.
- Used by: InnoDB, Postgres, BoltDB.
- LSM = write-fast, read-amp, compaction. Dominates write-heavy KVs.
- B-tree = read-fast, in-place. Dominates OLTP.
- Both use bloom filters / page caches; the difference is what's on disk.
Six systems define the field. BigTable taught us range-partitioned KV. Dynamo taught us AP + tunable consistency. Spanner taught us global linearizability via TrueTime. Kafka taught us the log as primary abstraction. S3 taught us strong consistency at exabyte scale. CockroachDB taught us Spanner without atomic clocks. Know the one-line "killer move" of each.
| System | Killer move | Read this |
|---|---|---|
| BigTable (Chang 2006) | GFS + Chubby + tablet servers; sparse multi-dim sorted map. | Original BigTable paper. |
| Spanner (Corbett 2012) | Global linearizability via TrueTime + Paxos per shard + 2PC across shards. | Spanner paper, then CockroachDB design doc. |
| Dynamo (DeCandia 2007) | Consistent hashing + sloppy quorum + vector clocks. AP first. | Dynamo paper. Influenced Cassandra, Riak, DynamoDB. |
| Cassandra | Dynamo + BigTable hybrid. Tunable consistency (LOCAL_QUORUM, EACH_QUORUM). | Cassandra docs on consistency levels. |
| CockroachDB | Spanner without atomic clocks: HLC + max offset + Raft per range. | Cockroach blog, especially "Living Without Atomic Clocks." |
| Kafka (Kreps 2011) | Log-structured commit log; topic = partitioned log; ISR semantics. | LinkedIn paper + Confluent's exactly-once blog. |
| S3 | Strong read-after-write consistency since Dec 2020. Throughput sharded by prefix. | AWS strong consistency announcement; Marc Brooker's blog. |
Kafka exactly-once — the recipe
(1) Idempotent producer: producer ID + per-partition sequence number → broker dedups retried writes. (2) Transactions: producer wraps writes across partitions; transaction coordinator atomically commits or aborts. Consumers in read_committed isolation only see committed records.
Spanner: why both Paxos AND 2PC
Common confusion: "Spanner has Paxos, why does it need 2PC?" Paxos replicates within a single shard (Paxos group). Cross-shard transactions need 2PC across Paxos groups, where each "participant" is itself a fault-tolerant Paxos group. So 2PC's blocking problem is mitigated — the coordinator and every participant are themselves replicated.
- BigTable: range-partitioned KV on GFS+Chubby.
- Spanner: TrueTime + Paxos shards + 2PC across shards = strict serializable globally.
- Dynamo: AP + consistent hashing + sloppy quorum + vector clocks.
- Kafka: log = primary abstraction; exactly-once = idempotent producer + transactions.
- S3: strong R-A-W since Dec 2020; shard throughput by prefix.
The "back-of-envelope" service-design vocabulary: L4/L7 load balancing with P2C, cache hierarchies (cache-aside vs write-through), token-bucket rate limiting in Redis with Lua, fencing tokens for distributed locks, idempotency keys (Stripe), circuit breakers + bulkheads + jittered retries. Every system-design loop draws on these.
Load balancing
L4 = TCP-level (HAProxy in TCP mode, AWS NLB). L7 = HTTP-level (Envoy, NGINX, AWS ALB) — can route on path/header/cookie, terminate TLS, do retries. Algorithms: round-robin, least-connections, consistent-hash, EWMA.
Power of two choices (P2C) — pick 2 backends at random, send to the less loaded. Mitzenmacher's result: with O(N log log N / log N) max load vs O(log N / log log N) for naive random. Nearly optimal at almost zero coordination cost. Use it.
Caching
- Cache-aside (lazy): app reads cache; on miss, reads DB and populates cache.
- Write-through: writes hit cache and DB synchronously.
- Write-behind: writes hit cache; flushed to DB async. Risk: data loss on cache crash.
- Eviction policies: LRU (most common), LFU, ARC, TinyLFU, S3-FIFO, SIEVE. SIEVE (2024) is the new low-overhead default.
Rate limiting
Three classic shapes:
- Token bucket — steady refill rate, allows bursts up to bucket size. Most flexible.
- Leaky bucket — steady output rate, no burst. Use when downstream can't burst.
- Sliding window — count requests in last T seconds. Approximate variants for memory.
Atomic refill + deduct in a single round-trip. Tokens and last-refill timestamp are stored per key; the Lua script computes elapsed time, refills, and deducts in one shot.
-- KEYS[1] = bucket key; ARGV: rate, capacity, now
local b = redis.call('HMGET', KEYS[1], 'tokens', 'ts')
local tokens, ts = tonumber(b[1]), tonumber(b[2])
local rate, cap, now = tonumber(ARGV[1]), tonumber(ARGV[2]), tonumber(ARGV[3])
if not tokens then tokens, ts = cap, now end
tokens = math.min(cap, tokens + (now - ts) * rate)
local allowed = 0
if tokens >= 1 then tokens = tokens - 1; allowed = 1 end
redis.call('HMSET', KEYS[1], 'tokens', tokens, 'ts', now)
redis.call('EXPIRE', KEYS[1], 3600)
return allowed
For 100k+ req/s: shard Redis by key, or use approximate local counters with periodic sync to a central store. Stripe's pattern.
Other essentials
- Idempotency keys — client sends a UUID; server stores (key → result); safe retries. Stripe's pattern.
- Circuit breakers / bulkheads / timeouts / jittered exponential retries — Hystrix-style. Avoid retry storms with full jitter (random in [0, backoff]).
- Service mesh — Envoy / Istio sidecar proxies handle mTLS, retries, observability.
- Event sourcing — state = log of events. CQRS = separate read/write models.
- P2C beats round-robin almost everywhere with no coordination cost.
- Cache-aside is the default. Write-through for read-after-write.
- Token bucket in Redis with Lua = standard rate limiter; shard for >100k qps.
- Distributed locks need fencing tokens. Always.
- Retries: full jitter exponential backoff. Idempotency keys for safety.
TCP is reliable, ordered, has slow-start and HOL blocking. QUIC is UDP + TLS + multiplexed streams with no HOL between streams; HTTP/3 runs on it. HTTP/2 multiplexes on TCP (still HOL on packet loss); HTTP/3 fixes that on QUIC. gRPC = HTTP/2 + Protobuf, with streaming RPC and deadlines. TLS 1.3 is 1-RTT.
| Protocol | What it is | Why you'd use it |
|---|---|---|
| TCP | Stream, reliable, ordered. Slow-start, congestion control (CUBIC, BBR). HOL blocking on loss. | The default for everything reliable. |
| UDP | Datagram, unreliable, fast. No connection state. | QUIC's transport, DNS, video, games. |
| QUIC | UDP + TLS + multiplexed streams. No HOL between streams. 0-1 RTT handshake. | HTTP/3, faster mobile, packet-loss-tolerant. |
| HTTP/1.1 | One request per connection (pipelined rarely works). | Legacy; still simple to debug. |
| HTTP/2 | Multiplexed on TCP. HOL on packet loss (TCP, not stream). | Reduces connection overhead. |
| HTTP/3 | Multiplexed on QUIC. No HOL. | Mobile / lossy networks; Cloudflare default. |
| gRPC | HTTP/2 + Protobuf. Streaming RPC, deadlines, interceptors. | Internal RPC at Google, Anthropic, etc. |
| WebSocket | Long-lived bidirectional TCP connection. | Real-time push (chat, collab). |
| SSE | One-way server-to-client streaming over HTTP. | Token streaming from LLM APIs. |
| TLS 1.3 | 1-RTT handshake; 0-RTT for resumption (replay risk). | Default for everything HTTPS. |
- TCP HOL is at the transport layer; HTTP/2 doesn't fix it. HTTP/3 (QUIC) does.
- gRPC = HTTP/2 + Protobuf; bidirectional streaming.
- SSE for LLM token streaming; WebSocket for full bidi.
ML-specific distributed (covered elsewhere)
Parallel training and inference patterns live on distributed training and LLM inference. Vector DB internals are in ML system design problem 8.
- Raft (Ongaro & Ousterhout 2014) — read in full
- Spanner (Corbett 2012)
- Dynamo (DeCandia 2007)
- BigTable (Chang 2006)
- GFS (Ghemawat 2003)
- MapReduce (Dean 2004)
- ZooKeeper (Hunt 2010)
- Kafka (Kreps 2011) — LinkedIn paper
- Calvin (Thomson 2012) — deterministic distributed DB
- FLP impossibility (Fischer, Lynch, Paterson 1985)
0 → hero distributed systems path (the common weak area — invest heavily)
- foundation Designing Data-Intensive Applications (Kleppmann) — read all 12 chapters; 5–9 are core
- foundation Martin Kleppmann's Distributed Systems lecture series (Cambridge, free)
- foundation ByteByteGo — system design illustrations
- build MIT 6.824 Distributed Systems — implement Raft (Lab 2) and KVStore (Lab 3) in Go. The single most useful exercise.
- build Read etcd's Raft implementation
- depth Raft (Ongaro & Ousterhout 2014) — read in full
- depth Spanner (Corbett 2012)
- depth Dynamo (DeCandia 2007)
- depth BigTable (Chang 2006)
- depth GFS (Ghemawat 2003)
- depth MapReduce (Dean 2004)
- depth Martin Kleppmann's blog
- depth Marc Brooker's blog (AWS Principal Engineer — distributed systems essays)
- depth Murat Demirbas — paper reviews
- depth Jepsen — distributed systems testing reports (consistency violations in real systems)
- depth Companion: full ~104KB curriculum at
../distributed_systems_curriculum.md
Distributed systems quiz — readiness check
- Explain CAP precisely.
Show answer
Under network partition, you choose between consistency (linearizability) and availability (every non-failed node responds). P is not optional in real networks — you choose CP or AP during a partition. Common misreading: "pick 2 of 3" — wrong, because real systems must tolerate partitions.
- Difference between snapshot isolation and serializable?
Show answer
SI doesn't prevent write skew: two transactions both reading and updating disjoint rows based on a constraint that depends on the other's reads. SSI (Cahill 2008) adds predicate locks + dependency tracking to detect write skew. PostgreSQL's "serializable" is SSI.
- How does Spanner achieve external consistency?
Show answer
TrueTime gives bounded uncertainty intervals via GPS + atomic clocks. Commit-wait: after assigning a commit timestamp, wait for the uncertainty interval to pass before releasing the commit. Ensures any subsequent transaction sees this commit's timestamp.
- Walk through Raft.
Show answer
Three sub-problems: (1) Leader election: random timeouts → become candidate → request votes → win majority → leader. (2) Log replication: leader appends + replicates + commits when majority ack. (3) Safety: leader's log wins; only commit current-term entries. Membership changes via joint consensus.
- How does Dynamo handle conflicts?
Show answer
Vector clocks identify concurrent versions; client (or LWW) resolves on read. Sloppy quorum + hinted handoff for availability. Read repair: when read sees divergent versions, async write the merged value back. Anti-entropy via Merkle tree comparison between replicas.
- Compare LSM tree vs B-tree.
Show answer
LSM: write-fast (sequential append), read-amp (bloom + multi-level lookup), needs compaction. B-tree: balanced, in-place updates, slower writes. LSM dominates write-heavy KVs (Cassandra, RocksDB). B-tree dominates OLTP (Postgres, InnoDB).
- Design a rate limiter at 100k req/s per key.
Show answer
Token bucket per key in Redis with Lua atomic update: store tokens + last-refill timestamp; on each request, refill based on (now − last) × rate, capped; deduct if enough; return decision. Higher scale: shard Redis or local approximate counters with periodic sync.
- Kafka exactly-once semantics — how?
Show answer
(1) Idempotent producer: PID + per-partition sequence number → broker dedups retries. (2) Transactions: producer wraps writes across partitions in a transaction; transaction coordinator atomically commits/aborts. Consumers in read_committed only see committed messages.
- Consistent hashing vs jump hash vs rendezvous?
Show answer
Consistent hash: ring with virtual nodes; adding a node moves K/N keys. Jump hash (Lamping & Veach 2014): O(log n) compute, no node IDs. Rendezvous (HRW): hash (key, node) for all nodes; pick max — handles weighted nodes naturally.
- What is FLP and what does it mean for production systems?
Show answer
FLP impossibility (Fischer, Lynch, Paterson 1985): in a fully asynchronous network with even one crash, no deterministic protocol guarantees both safety and liveness. Real systems sidestep with timeouts (semi-synchrony assumption), randomization, or accept rare unavailability. Raft and Paxos prefer safety over liveness.
- Why use vector clocks vs Lamport clocks?
Show answer
Lamport clock: total order, doesn't preserve causality fully. Vector clock: per-process counter array; captures full causality (a → b iff VC(a) < VC(b)). Cost: size O(N processes). Use vector when you need to detect concurrent vs causally-ordered events; Lamport when you only need total order.
- Quorum math: R + W > N — why?
Show answer
If R + W > N, every read intersects every write (pigeonhole). So a read sees at least one replica that has the latest write. Strong-ish consistency without coordination. R = W = ⌈N/2⌉+1 is the typical balanced choice.
- Explain CRDTs and give two examples.
Show answer
Conflict-free replicated data types: structures that converge under concurrent updates without coordination. G-Counter: per-replica counter, sum on read. OR-Set: tag adds with unique IDs; deletes only remove tags you've seen. Used in collaborative editors (Figma, Linear) and offline-first apps.
- Why does Spanner need 2PC despite Paxos?
Show answer
Paxos is for replication within a Paxos group (one shard). Cross-shard transactions need 2PC across groups. Spanner: 2PC over Paxos — each "participant" in 2PC is a Paxos group, so the transaction is durable even with single-node failures.
- What is sloppy quorum + hinted handoff?
Show answer
If a designated replica is down, the write goes to the next node in the ring (sloppy — outside the strict quorum). The receiving node holds the data as a "hint" and forwards to the original when it recovers. Tradeoff: improves write availability; risks stale reads on the original.
- Phi-accrual vs heartbeat for failure detection?
Show answer
Heartbeat: ping with timeout; binary alive/dead. Phi-accrual (Hayashibara 2004): probabilistic — outputs a continuous suspicion level φ; threshold tuning per-app. Adapts to network jitter. Used in Cassandra, Akka.
- Design a distributed lock service.
Show answer
Cluster of 5 nodes running Raft. Lock = key in replicated log with TTL + owner. Acquire: CAS (set if not exists); on fail, watch for delete event. Release: delete key. Crash safety: TTL releases abandoned locks. Use fencing tokens (monotonic) to prevent stale clients from acting after lease expiry.
- Explain MapReduce in one paragraph.
Show answer
Distributed computation framework. User writes map(k, v) → list[(k, v)] and reduce(k, list[v]) → output. System partitions input across mappers, shuffles by key, runs reducers on grouped data. Handles fault tolerance (re-run failed tasks), data locality (run map close to data), straggler mitigation (speculative execution).
- What is read-your-writes consistency?
Show answer
A user always reads their own latest writes (even in an eventually-consistent system). Implemented by routing reads to the same replica that handled writes, or by tracking client write timestamps and reading from a replica caught up past that timestamp.
- What's PACELC and how does it extend CAP?
Show answer
PACELC (Abadi 2012): partitioned → choose C or A; else (no partition) → L (latency) or C. Most real systems trade off in normal operation too (not just during partitions). Spanner: CP / EC. Dynamo: AP / EL.