SYSTEMS PILLAR · STATED WEAK AREA · INVEST MOST

Distributed systems

Every frontier-lab loop probes this, and it's where you've said you're weakest. This chapter teaches the lens (CAP/PACELC), the impossibility result (FLP), the building blocks (consensus, quorums, time, replication), and the canonical systems (Spanner, Dynamo, Kafka, S3) tightly enough that you can answer any senior-staff question with first principles instead of memorized trivia.

Read ~45 min Asked at Anthropic, OpenAI, Google, Stripe, AWS, every infra company Difficulty Sr / Staff bar Companion ../distributed_systems_curriculum.md (~104KB)

What you'll learn

CAP & PACELC — the lens that organizes everything
The impossibility result — FLP and what it forced us to do
Consistency lattice — linearizable to eventual
Time in distributed systems — Lamport, vector, HLC, TrueTime
Distributed transactions — 2PC, sagas, when each works
Consensus — Paxos vs Raft vs ZAB
Quorums & consistent hashing — how data finds its replica
Replication topologies — single-leader to leaderless
CRDTs — convergence without coordination
Failure detection — heartbeats, phi-accrual, SWIM
Storage engines — LSM vs B-tree, the canonical choice
The systems you must know — Spanner, Dynamo, BigTable, Kafka, S3
Service architecture — load balancing, caching, rate limiting
Networking quick reference — TCP, QUIC, gRPC
The 10 must-know papers

FOUNDATIONS · THE LENS

CAP & PACELC — the lens that organizes everything

TL;DR

CAP is not "pick 2 of 3." Network partitions are non-negotiable in real systems, so the only real choice is during a partition, do you sacrifice consistency (AP) or availability (CP)? PACELC adds the missing dimension: else (no partition), do you favor latency (EL) or consistency (EC)? Every distributed database in the wild can be placed on this 2×2.

CAP, stated correctly

Brewer 2000, formalized by Gilbert & Lynch 2002. In the presence of a network partition, you must choose between consistency (linearizability — every read sees the latest write) and availability (every non-failed node responds successfully). You cannot have both during a partition.

THE PRECISE STATEMENT (memorize this)

CAP is a partition-time choice, not a design-time menu

"Pick 2 of 3" is the wrong framing. P (partition tolerance) is not optional in real distributed systems — networks fail. So you are choosing between CP and AP during the partition. Outside of partitions, you can have both C and A. This single correction is the most common entry-level filter at Google, Stripe, and Anthropic.

PACELC — the missing dimension

Abadi 2012. CAP only describes behavior during a partition; PACELC asks what you do the rest of the time. Partitioned → Availability or Consistency; Else (no partition) → Latency or Consistency. Most real systems trade consistency for latency even in the steady state.

System	P	EL/EC	Why
Spanner	CP	EC	Pays in latency for global linearizability via TrueTime commit-wait.
Dynamo / Cassandra	AP	EL	Always answers; tunable weak reads (LOCAL_QUORUM keeps it sane).
HBase / BigTable	CP	EC	Single tablet server per range; reads block on lease handoff.
MongoDB (default)	CP-ish	EC	Primary-only writes; reads can be relaxed via readPreference.
CockroachDB	CP	EC	Spanner-style without atomic clocks; HLC + max offset.

REMEMBER

P is forced. The real CAP choice is CP vs AP during a partition.
PACELC is what you do the other 99% of the time. EL vs EC.
Spanner = CP/EC. Dynamo = AP/EL. Place every system you discuss on this matrix.

FOUNDATIONS · IMPOSSIBILITY

The impossibility result — FLP and what it forced us to do

TL;DR

Fischer, Lynch, Paterson (1985): in a fully asynchronous network with even one crash failure, no deterministic consensus protocol can guarantee both safety and liveness. Real systems sidestep with timeouts (semi-synchrony), randomization, or by accepting rare unavailability. Raft and Paxos always pick safety over liveness when forced.

What "asynchronous" means here

Asynchronous = no upper bound on message delay or processing time. You cannot tell a slow node from a dead one. Given that, the FLP proof shows there is always an execution where consensus runs forever without deciding.

How real systems escape FLP

Failure detectors / timeouts — assume partial synchrony. Most-of-the-time messages arrive within a bound; treat slow nodes as failed and elect new leaders. Raft and Paxos do this.
Randomization — Ben-Or 1983. Coin flips break symmetry; expected (not bounded) termination.
Accept rare unavailability — under genuine asymmetric network conditions (true split-brain), Raft halts rather than violating safety. Liveness yielded for safety.

REMEMBER

FLP: async + 1 crash + deterministic = no protocol can guarantee safety AND liveness.
Sidesteps: timeouts (semi-synchrony), randomization, or accept halts.
Raft/Paxos give up liveness during pathological partitions; never give up safety.

FOUNDATIONS · CONSISTENCY MODELS

Consistency lattice — linearizable to eventual, with examples

TL;DR

Two lattices to keep straight: a replication consistency lattice (linearizable → sequential → causal → eventual) and a transaction isolation lattice (serializable → snapshot isolation → read committed → read uncommitted). Strict serializable = serializable + linearizable; that's what Spanner gives. Snapshot isolation looks safe but lets write skew through.

Replication consistency, strongest first

Linearizability — each op appears to take effect at one instant between invocation and response, consistent with real time. The illusion of a single copy.
Sequential consistency — total order consistent with each process's program order, but no real-time guarantee.
Causal consistency — causally related ops are seen in the same order by all observers; concurrent ops can differ.
Eventual consistency — replicas converge if updates stop. No bound on when.

Transaction isolation, strongest first

Serializable — outcome equivalent to some serial execution.
Snapshot isolation (SI) — each tx reads a consistent snapshot at start time; first-writer-wins for conflicting writes.
Read committed — never read uncommitted data, but reads inside one tx may see different values.
Read uncommitted — anything goes.

Strict serializable = serializable + linearizable. Spanner provides this; PostgreSQL "serializable" is SSI (Cahill 2008), which is serializable but not linearizable.

PITFALL — write skew under snapshot isolation

Two on-call doctors both check "is at least one other doctor on call?", both see yes, both go off-call concurrently. SI sees no write conflict (different rows updated). Result: zero doctors on call. Fix: SSI (predicate locking), or explicit SELECT ... FOR UPDATE. Postgres "serializable" catches this; MySQL "serializable" doesn't really.

REMEMBER

Linearizable is about real time across replicas; serializable is about some serial order.
Strict serializable = both. Spanner.
Snapshot isolation lets write skew through. SSI fixes it via predicate locks + dependency tracking.

FOUNDATIONS · TIME

Time in distributed systems — Lamport, vector, HLC, TrueTime

TL;DR

Wall-clock time is unreliable across machines (NTP ~10ms drift, PTP <1ms). Logical clocks (Lamport, vector) give causal ordering without trusting hardware. Hybrid logical clocks (CockroachDB) add a physical anchor with bounded skew. TrueTime (Spanner) inverts the problem: GPS + atomic clocks give a bounded uncertainty interval, and you sleep through it (commit-wait) to make external consistency cheap.

The four clocks you must know

Clock	What it gives	Cost	Used by
Physical (NTP/PTP)	Wall time, ~10ms / <1ms drift	Cheap	Logging, anything not safety-critical
Lamport	Total order, partial causality	1 counter / event	Educational; rarely used directly
Vector clock	Full causality (a → b iff VC(a) < VC(b))	O(N) per event	Dynamo, Riak; conflict detection
HLC	Logical + bounded physical anchor	~16 bytes	CockroachDB, MongoDB
TrueTime	Bounded uncertainty interval [earliest, latest]	GPS + atomic clocks in every DC	Spanner

Lamport clocks — the rule

On every event: L = max(L_local, L_received) + 1. Stamps every message with a counter. Gives a total order, but loses concurrency information (you can't tell from Lamport alone whether two events were causally related or independent).

Vector clocks — full causality

Each process maintains an N-element array. On local event: bump own slot. On send: piggyback the array. On receive: take element-wise max, bump own slot. Compare by element-wise <.

EXAMPLE — Spanner commit-wait, walked through

A transaction T1 wants to commit at now(). TrueTime returns [earliest=10:00:00.000, latest=10:00:00.007] — uncertainty 7ms. Spanner picks commit_ts = latest = 10:00:00.007, then sleeps until TrueTime.now().earliest > 10:00:00.007 (roughly 7ms). Only then releases the commit. Why: any later transaction T2 reading the database now sees a now().earliest strictly greater than T1.commit_ts, so it cannot be assigned an earlier timestamp — external consistency holds globally without coordination between datacenters.

REMEMBER

Lamport = total order, no causality. Vector = causality, O(N) cost.
HLC = practical compromise. CockroachDB.
TrueTime = bound uncertainty, then wait through it. Spanner's killer move.

PROTOCOLS · TRANSACTIONS

Distributed transactions — 2PC, sagas, when each works

TL;DR

2PC is correct but blocks if the coordinator dies after prepare. 3PC fixes blocking but assumes synchronous networks (which don't exist). Paxos Commit replicates the coordinator. Sagas drop ACID entirely — long-running flows split into compensable steps, no distributed lock. Real systems: 2PC for short cross-shard tx (Spanner does 2PC over Paxos), sagas for cross-service workflows (orders, payments).

2PC — prepare, then commit

Coordinator sends PREPARE to all participants.
Each participant locks resources, writes a prepare record to its log, votes YES or NO.
If all YES, coordinator writes COMMIT to its log, sends COMMIT. If any NO or timeout, sends ABORT.
Participants apply and release locks.

The blocking problem: if the coordinator crashes after participants vote YES but before sending COMMIT/ABORT, participants are stuck holding locks indefinitely. They cannot decide unilaterally because they don't know what the coordinator told the others.

3PC and Paxos Commit

3PC adds a pre-commit phase so participants can recover safely if the coordinator dies — but it only works under synchronous network assumptions, which is why it's almost never deployed. Paxos Commit (Gray & Lamport 2006) is the production answer: replicate the coordinator's state machine via Paxos. Spanner does this — each shard is a Paxos group, and 2PC runs across groups.

Sagas — the alternative for long flows

(Garcia-Molina & Salem 1987.) Break a long-running tx into N local transactions, each with a compensating action (refund, cancel, release inventory). On failure mid-flow, run compensations for completed steps in reverse. No distributed lock. Caveat: not isolated — other readers see intermediate states.

Use 2PC when

Cross-shard tx within one trusted system (Spanner, CockroachDB).
Short critical sections; locks held for ms.
You can run a Paxos-replicated coordinator.

Use sagas when

Cross-service workflow (order → payment → shipping).
Steps run for seconds to days.
You can write a compensating action for every step.

REMEMBER

2PC: correct, blocks if coordinator dies after prepare. Used over Paxos in Spanner.
3PC: rarely deployed — assumes synchrony.
Sagas: cross-service workflows; every step needs a compensating action.

PROTOCOLS · CONSENSUS

Consensus — Paxos vs Raft vs ZAB

TL;DR

Paxos is the original; correct, fast, infamously hard to implement. Raft (2014) is Paxos restructured for understandability — same guarantees, simpler mental model, dominant in 2020s. ZAB is ZooKeeper's variant (primary-backup with total order). Build Raft yourself in MIT 6.824 Lab 2 — it's the single highest-leverage exercise for this entire pillar.

Paxos in one paragraph

Two phases. Prepare: a proposer picks a unique number n, asks acceptors to "promise" not to accept anything < n; if it gets a majority of promises, it learns the highest-numbered already-accepted value. Accept: it asks the same majority to accept (n, value); if all promise still holds, the value is chosen. Quorum (majority) ensures any two rounds intersect. Multi-Paxos elides the prepare phase once a leader is stable — the typical case.

Raft — the modern default

Ongaro & Ousterhout, USENIX ATC 2014. Designed top-down for understandability. Splits consensus into three sub-problems: leader election, log replication, safety.

THE INSIGHT — Raft in 3 rules

Memorize these and you can answer any Raft question

Elect by majority with random timeouts. Followers become candidates on timeout, request votes; whoever wins a majority (in a term) is leader. Random timeouts prevent split votes.
Replicate the log; commit by majority. Leader appends entries, sends AppendEntries to followers; once a majority has stored an entry, it is committed and applied to the state machine.
Leader Completeness: a leader's log contains all committed entries; only commit entries from the current term. The "only commit current-term entries" rule is the subtle one — it prevents an old leader's entry from being overwritten after re-election. It took 6 months to nail this down in the original paper.

Membership changes: joint consensus — overlap old and new configurations during the transition so any majority of old AND any majority of new must overlap, preserving safety.

Build it yourself

Implement Raft in MIT 6.824 Lab 2 (Go). The single most useful exercise for distributed-systems interviews. After Lab 2 + Lab 3 (KVStore on Raft), you can answer Raft questions from first principles instead of recall.

ZAB — ZooKeeper Atomic Broadcast

Primary-backup with total order. Like Raft, has leader election and log replication, but ZAB explicitly preserves order of all messages from the primary across reconfiguration. Used by ZooKeeper; its semantics (linearizable writes, sequentially-consistent reads, watches) are the basis for Chubby-style coordination services.

REMEMBER

Raft = leader election (random timeouts) + log replication (majority commit) + Leader Completeness (only commit current term).
Multi-Paxos = Paxos with stable leader, prepare-elision; equivalent to Raft in practice.
ZAB powers ZooKeeper. Chubby uses Multi-Paxos.
Implement Raft yourself. There is no substitute.

DATA PLACEMENT · QUORUMS

Quorums & consistent hashing — how data finds its replica

TL;DR

Quorum math: R + W > N ⇒ every read intersects every write ⇒ strong-ish consistency. Consistent hashing puts keys and nodes on a ring; adding a node moves only K/N keys instead of nearly all of them. Jump hash (O(log n), no node IDs) and rendezvous (HRW, weight-friendly) are the two refinements you should know.

Quorum intuition

With N replicas, R read replicas, W write replicas: if R + W > N, every read set and every write set share at least one node (pigeonhole). So a read will see at least one replica that has the latest write. Typical balanced choice: R = W = ⌈N/2⌉ + 1.

Consistent hashing

Hash both keys and nodes onto a ring (e.g., 0..2^32). Each key is owned by the next clockwise node. Adding or removing a node only moves K/N keys, vs nearly all under naive modulo hashing. Virtual nodes (each physical node owns many ring positions) smooth out load when N is small.

Two refinements you should know

Jump hash (Lamping & Veach 2014) — O(log n) compute, no node IDs needed, no memory state. Limitation: only works for "node count goes from N to N+1" — you can't remove arbitrary nodes mid-ring. Good for stateless services with a known cluster size.
Rendezvous (HRW) hashing — for each key, hash (key, node_id) for every node; pick the node with the max hash. Same K/N rebalance property as consistent hashing. Trivially supports weighted nodes (multiply hash by weight).

PITFALL — sloppy quorum is not strict quorum

Dynamo-style "sloppy quorum" sends writes to the next available nodes if the designated ones are down (with hinted handoff to forward later). It improves availability but breaks the R + W > N guarantee — your read might land on the strict replicas while the write went to backup nodes. Don't claim strong consistency on a sloppy-quorum system.

REMEMBER

R + W > N ⇒ every read intersects every write.
Consistent hashing: K/N keys move on add/remove. Virtual nodes for small N.
Jump hash: O(log n), stateless, but rigid. HRW: same property + weights.
Sloppy quorum ≠ quorum. Don't conflate.

REPLICATION · TOPOLOGIES

Replication topologies — single-leader to leaderless

TL;DR

Four topologies: single-leader (Postgres, MySQL — simple, write throughput bounded by leader), multi-leader (Cassandra-EACH_QUORUM, multi-region — needs conflict resolution), leaderless (Dynamo — sloppy quorums, gossip), chain replication (head→tail — strong consistency, simple recovery, used in object stores like Microsoft Azure FCRC). Sync vs async replication is the fundamental availability/durability dial.

The four topologies

Single-leader: writes go to the leader, replicate to followers. Sync replication = no data loss but writes block on slowest follower; async = fast but you can lose recently-acked writes on failover. Most production setups use one sync follower (durability) and N async (read scaling).
Multi-leader: writes can go to any leader; conflicts must be resolved (LWW, CRDTs, or app logic). Useful for multi-datacenter writes when you cannot tolerate cross-region write latency.
Leaderless (Dynamo-style): any replica accepts writes; gossip + sloppy quorum + hinted handoff for availability; read-repair on conflict; anti-entropy via Merkle tree comparison.
Chain replication (van Renesse & Schneider 2004): writes flow head → ... → tail; reads from tail. Strong consistency with simple failure recovery (just shorten/extend the chain).

Sync vs async — the durability dial

Sync replication: leader waits for followers to ack before returning OK to the client. No data loss on leader crash, but availability drops if a follower lags. Async: leader returns OK after local write; recently-acked writes can be lost if the leader dies before replication catches up. Semi-sync (MySQL, Postgres): wait for at least one follower to ack — usually the right tradeoff.

REMEMBER

Single-leader is the default; sync to one follower, async to the rest.
Multi-leader needs conflict resolution. Don't use without one.
Leaderless = Dynamo. Tunable consistency via R, W.
Chain replication: strong consistency, simple recovery — underrated.

REPLICATION · CONFLICT-FREE

CRDTs — convergence without coordination

TL;DR

CRDTs are data structures whose merge is commutative, associative, and idempotent — replicas converge no matter the order of updates. G-Counter (sum of per-replica counters), OR-Set (tagged adds, tombstone-aware deletes), LWW-Register (timestamp wins), RGA (collaborative text). Used in Figma, Linear, Riak, and offline-first apps. The price: weaker semantics than serializable, more storage (tombstones).

The four you should know cold

G-Counter (grow-only) — array of per-replica counters; increment your own slot; read = sum; merge = element-wise max.
PN-Counter — two G-Counters (one for increments, one for decrements); read = inc - dec.
OR-Set (observed-remove) — every add tags the element with a unique ID; remove only erases tags you have actually observed. Concurrent add+remove resolves to "present."
LWW-Register — timestamped value; latest timestamp wins. Loses concurrent writes — use only when "last write wins" is semantically OK.
RGA (replicated growable array) — text editing CRDT; each character has a unique ID and reference to its predecessor; insertion is commutative.

CRDT vs OT

Operational Transform (Google Docs) achieves the same goal differently: transform concurrent ops to commute. Requires a server to linearize. CRDTs converge peer-to-peer without one. OT is dominant historically; CRDTs dominate new collaborative apps because they're easier to reason about offline.

REMEMBER

CRDT = merge is commutative + associative + idempotent ⇒ converges.
G-Counter, OR-Set, LWW-Register, RGA — name and explain each.
CRDTs trade serializability for coordination-free availability.

FAULTS · DETECTION

Failure detection — heartbeats, phi-accrual, SWIM

TL;DR

You can never tell a slow node from a dead one (FLP again). The job of a failure detector is to bound the wrongness. Heartbeats give a binary alive/dead signal; phi-accrual gives a continuous suspicion level adapting to network jitter; SWIM is gossip-based for thousands of nodes.

The three you should know

Heartbeats — periodic ping with timeout. Simple, but the timeout choice is brutal: too short and you false-positive; too long and you take forever to detect a real failure.
Phi-accrual (Hayashibara 2004) — instead of binary, output a continuous suspicion level φ = -log₁₀(probability of mistake). App tunes a threshold. Adapts to observed inter-arrival distribution. Used in Cassandra, Akka.
SWIM (Das 2002) — Scalable Weakly-consistent Infection-style membership. Each node periodically pings a random peer; on failure, asks K random peers to indirectly probe. Gossips suspicions and confirmations. Scales to thousands of nodes. Used in HashiCorp Serf, Memberlist, Consul.

REMEMBER

Heartbeat = binary, simple, hard to tune.
Phi-accrual = continuous, adaptive, app picks threshold. Cassandra default.
SWIM = gossip-based, scales to thousands of nodes.

STORAGE · ENGINES

Storage engines — LSM vs B-tree, the canonical choice

TL;DR

Two storage philosophies: LSM trees (RocksDB, Cassandra, BigTable) optimize writes by appending to a log and merging in the background — fast writes, slower reads, needs compaction. B-trees (InnoDB, Postgres, BoltDB) update in place — fast reads, slower random writes, no compaction. LSM dominates write-heavy KV stores; B-tree dominates OLTP.

How LSM works in 4 lines

Writes go to a memtable (sorted in-memory) + a write-ahead log on disk.
When the memtable fills, it flushes to an SSTable (sorted, immutable file) at level 0.
Background compaction merges SSTables across levels (size-tiered or leveled).
Reads: check memtable, then each level; bloom filters skip levels that can't contain the key.

LSM tree

Writes: fast (sequential append).
Reads: slower (multi-level + bloom).
Space amp: higher (multiple levels, tombstones).
Compaction: required, background.
Used by: RocksDB, Cassandra, BigTable, ScyllaDB, LevelDB.

B-tree

Writes: slower (random IO).
Reads: fast (single tree walk).
Space amp: lower (in-place updates).
Compaction: not needed; vacuum / page splits.
Used by: InnoDB, Postgres, BoltDB.

REMEMBER

LSM = write-fast, read-amp, compaction. Dominates write-heavy KVs.
B-tree = read-fast, in-place. Dominates OLTP.
Both use bloom filters / page caches; the difference is what's on disk.

CANONICAL SYSTEMS

The systems you must know — Spanner, Dynamo, BigTable, Kafka, S3

TL;DR

Six systems define the field. BigTable taught us range-partitioned KV. Dynamo taught us AP + tunable consistency. Spanner taught us global linearizability via TrueTime. Kafka taught us the log as primary abstraction. S3 taught us strong consistency at exabyte scale. CockroachDB taught us Spanner without atomic clocks. Know the one-line "killer move" of each.

System	Killer move	Read this
BigTable (Chang 2006)	GFS + Chubby + tablet servers; sparse multi-dim sorted map.	Original BigTable paper.
Spanner (Corbett 2012)	Global linearizability via TrueTime + Paxos per shard + 2PC across shards.	Spanner paper, then CockroachDB design doc.
Dynamo (DeCandia 2007)	Consistent hashing + sloppy quorum + vector clocks. AP first.	Dynamo paper. Influenced Cassandra, Riak, DynamoDB.
Cassandra	Dynamo + BigTable hybrid. Tunable consistency (LOCAL_QUORUM, EACH_QUORUM).	Cassandra docs on consistency levels.
CockroachDB	Spanner without atomic clocks: HLC + max offset + Raft per range.	Cockroach blog, especially "Living Without Atomic Clocks."
Kafka (Kreps 2011)	Log-structured commit log; topic = partitioned log; ISR semantics.	LinkedIn paper + Confluent's exactly-once blog.
S3	Strong read-after-write consistency since Dec 2020. Throughput sharded by prefix.	AWS strong consistency announcement; Marc Brooker's blog.

Kafka exactly-once — the recipe

(1) Idempotent producer: producer ID + per-partition sequence number → broker dedups retried writes. (2) Transactions: producer wraps writes across partitions; transaction coordinator atomically commits or aborts. Consumers in read_committed isolation only see committed records.

Spanner: why both Paxos AND 2PC

Common confusion: "Spanner has Paxos, why does it need 2PC?" Paxos replicates within a single shard (Paxos group). Cross-shard transactions need 2PC across Paxos groups, where each "participant" is itself a fault-tolerant Paxos group. So 2PC's blocking problem is mitigated — the coordinator and every participant are themselves replicated.

REMEMBER

BigTable: range-partitioned KV on GFS+Chubby.
Spanner: TrueTime + Paxos shards + 2PC across shards = strict serializable globally.
Dynamo: AP + consistent hashing + sloppy quorum + vector clocks.
Kafka: log = primary abstraction; exactly-once = idempotent producer + transactions.
S3: strong R-A-W since Dec 2020; shard throughput by prefix.

SERVICES · ARCHITECTURE

Service architecture — load balancing, caching, rate limiting

TL;DR

The "back-of-envelope" service-design vocabulary: L4/L7 load balancing with P2C, cache hierarchies (cache-aside vs write-through), token-bucket rate limiting in Redis with Lua, fencing tokens for distributed locks, idempotency keys (Stripe), circuit breakers + bulkheads + jittered retries. Every system-design loop draws on these.

Load balancing

L4 = TCP-level (HAProxy in TCP mode, AWS NLB). L7 = HTTP-level (Envoy, NGINX, AWS ALB) — can route on path/header/cookie, terminate TLS, do retries. Algorithms: round-robin, least-connections, consistent-hash, EWMA.

Power of two choices (P2C) — pick 2 backends at random, send to the less loaded. Mitzenmacher's result: with O(N log log N / log N) max load vs O(log N / log log N) for naive random. Nearly optimal at almost zero coordination cost. Use it.

Caching

Cache-aside (lazy): app reads cache; on miss, reads DB and populates cache.
Write-through: writes hit cache and DB synchronously.
Write-behind: writes hit cache; flushed to DB async. Risk: data loss on cache crash.
Eviction policies: LRU (most common), LFU, ARC, TinyLFU, S3-FIFO, SIEVE. SIEVE (2024) is the new low-overhead default.

Rate limiting

Three classic shapes:

Token bucket — steady refill rate, allows bursts up to bucket size. Most flexible.
Leaky bucket — steady output rate, no burst. Use when downstream can't burst.
Sliding window — count requests in last T seconds. Approximate variants for memory.

EXAMPLE — distributed token bucket in Redis Lua

Atomic refill + deduct in a single round-trip. Tokens and last-refill timestamp are stored per key; the Lua script computes elapsed time, refills, and deducts in one shot.

-- KEYS[1] = bucket key; ARGV: rate, capacity, now
local b = redis.call('HMGET', KEYS[1], 'tokens', 'ts')
local tokens, ts = tonumber(b[1]), tonumber(b[2])
local rate, cap, now = tonumber(ARGV[1]), tonumber(ARGV[2]), tonumber(ARGV[3])
if not tokens then tokens, ts = cap, now end
tokens = math.min(cap, tokens + (now - ts) * rate)
local allowed = 0
if tokens >= 1 then tokens = tokens - 1; allowed = 1 end
redis.call('HMSET', KEYS[1], 'tokens', tokens, 'ts', now)
redis.call('EXPIRE', KEYS[1], 3600)
return allowed

For 100k+ req/s: shard Redis by key, or use approximate local counters with periodic sync to a central store. Stripe's pattern.

Other essentials

Idempotency keys — client sends a UUID; server stores (key → result); safe retries. Stripe's pattern.
Circuit breakers / bulkheads / timeouts / jittered exponential retries — Hystrix-style. Avoid retry storms with full jitter (random in [0, backoff]).
Service mesh — Envoy / Istio sidecar proxies handle mTLS, retries, observability.
Event sourcing — state = log of events. CQRS = separate read/write models.

PITFALL — distributed locks need fencing tokens

A client acquires a Redis/etcd lock with a 30s lease, then GCs for 40s. Lease expires; another client acquires the same lock. The first client wakes up and writes to the protected resource — corruption. Fix: every lock issues a monotonically increasing fencing token; the protected resource rejects writes with a stale token. (Martin Kleppmann's "How to do distributed locking" essay — must read.)

REMEMBER

P2C beats round-robin almost everywhere with no coordination cost.
Cache-aside is the default. Write-through for read-after-write.
Token bucket in Redis with Lua = standard rate limiter; shard for >100k qps.
Distributed locks need fencing tokens. Always.
Retries: full jitter exponential backoff. Idempotency keys for safety.

NETWORKING · QUICK REFERENCE

Networking quick reference — TCP, QUIC, gRPC

TL;DR

TCP is reliable, ordered, has slow-start and HOL blocking. QUIC is UDP + TLS + multiplexed streams with no HOL between streams; HTTP/3 runs on it. HTTP/2 multiplexes on TCP (still HOL on packet loss); HTTP/3 fixes that on QUIC. gRPC = HTTP/2 + Protobuf, with streaming RPC and deadlines. TLS 1.3 is 1-RTT.

Protocol	What it is	Why you'd use it
TCP	Stream, reliable, ordered. Slow-start, congestion control (CUBIC, BBR). HOL blocking on loss.	The default for everything reliable.
UDP	Datagram, unreliable, fast. No connection state.	QUIC's transport, DNS, video, games.
QUIC	UDP + TLS + multiplexed streams. No HOL between streams. 0-1 RTT handshake.	HTTP/3, faster mobile, packet-loss-tolerant.
HTTP/1.1	One request per connection (pipelined rarely works).	Legacy; still simple to debug.
HTTP/2	Multiplexed on TCP. HOL on packet loss (TCP, not stream).	Reduces connection overhead.
HTTP/3	Multiplexed on QUIC. No HOL.	Mobile / lossy networks; Cloudflare default.
gRPC	HTTP/2 + Protobuf. Streaming RPC, deadlines, interceptors.	Internal RPC at Google, Anthropic, etc.
WebSocket	Long-lived bidirectional TCP connection.	Real-time push (chat, collab).
SSE	One-way server-to-client streaming over HTTP.	Token streaming from LLM APIs.
TLS 1.3	1-RTT handshake; 0-RTT for resumption (replay risk).	Default for everything HTTPS.

REMEMBER

TCP HOL is at the transport layer; HTTP/2 doesn't fix it. HTTP/3 (QUIC) does.
gRPC = HTTP/2 + Protobuf; bidirectional streaming.
SSE for LLM token streaming; WebSocket for full bidi.

ML-specific distributed (covered elsewhere)

Parallel training and inference patterns live on distributed training and LLM inference. Vector DB internals are in ML system design problem 8.

★

CANON · 10 PAPERS

The 10 must-know papers

Raft (Ongaro & Ousterhout 2014) — read in full
Spanner (Corbett 2012)
Dynamo (DeCandia 2007)
BigTable (Chang 2006)
GFS (Ghemawat 2003)
MapReduce (Dean 2004)
ZooKeeper (Hunt 2010)
Kafka (Kreps 2011) — LinkedIn paper
Calvin (Thomson 2012) — deterministic distributed DB
FLP impossibility (Fischer, Lynch, Paterson 1985)

0 → hero distributed systems path (the common weak area — invest heavily)

foundation Designing Data-Intensive Applications (Kleppmann) — read all 12 chapters; 5–9 are core
foundation Martin Kleppmann's Distributed Systems lecture series (Cambridge, free)
foundation ByteByteGo — system design illustrations
build MIT 6.824 Distributed Systems — implement Raft (Lab 2) and KVStore (Lab 3) in Go. The single most useful exercise.
build Read etcd's Raft implementation
depth Raft (Ongaro & Ousterhout 2014) — read in full
depth Spanner (Corbett 2012)
depth Dynamo (DeCandia 2007)
depth BigTable (Chang 2006)
depth GFS (Ghemawat 2003)
depth MapReduce (Dean 2004)
depth Martin Kleppmann's blog
depth Marc Brooker's blog (AWS Principal Engineer — distributed systems essays)
depth Murat Demirbas — paper reviews
depth Jepsen — distributed systems testing reports (consistency violations in real systems)
depth Companion: full ~104KB curriculum at ../distributed_systems_curriculum.md

Distributed systems quiz — readiness check

Explain CAP precisely.
Show answer
Under network partition, you choose between consistency (linearizability) and availability (every non-failed node responds). P is not optional in real networks — you choose CP or AP during a partition. Common misreading: "pick 2 of 3" — wrong, because real systems must tolerate partitions.
Difference between snapshot isolation and serializable?
Show answer
SI doesn't prevent write skew: two transactions both reading and updating disjoint rows based on a constraint that depends on the other's reads. SSI (Cahill 2008) adds predicate locks + dependency tracking to detect write skew. PostgreSQL's "serializable" is SSI.
How does Spanner achieve external consistency?
Show answer
TrueTime gives bounded uncertainty intervals via GPS + atomic clocks. Commit-wait: after assigning a commit timestamp, wait for the uncertainty interval to pass before releasing the commit. Ensures any subsequent transaction sees this commit's timestamp.
Walk through Raft.
Show answer
Three sub-problems: (1) Leader election: random timeouts → become candidate → request votes → win majority → leader. (2) Log replication: leader appends + replicates + commits when majority ack. (3) Safety: leader's log wins; only commit current-term entries. Membership changes via joint consensus.
How does Dynamo handle conflicts?
Show answer
Vector clocks identify concurrent versions; client (or LWW) resolves on read. Sloppy quorum + hinted handoff for availability. Read repair: when read sees divergent versions, async write the merged value back. Anti-entropy via Merkle tree comparison between replicas.
Compare LSM tree vs B-tree.
Show answer
LSM: write-fast (sequential append), read-amp (bloom + multi-level lookup), needs compaction. B-tree: balanced, in-place updates, slower writes. LSM dominates write-heavy KVs (Cassandra, RocksDB). B-tree dominates OLTP (Postgres, InnoDB).
Design a rate limiter at 100k req/s per key.
Show answer
Token bucket per key in Redis with Lua atomic update: store tokens + last-refill timestamp; on each request, refill based on (now − last) × rate, capped; deduct if enough; return decision. Higher scale: shard Redis or local approximate counters with periodic sync.
Kafka exactly-once semantics — how?
Show answer
(1) Idempotent producer: PID + per-partition sequence number → broker dedups retries. (2) Transactions: producer wraps writes across partitions in a transaction; transaction coordinator atomically commits/aborts. Consumers in read_committed only see committed messages.
Consistent hashing vs jump hash vs rendezvous?
Show answer
Consistent hash: ring with virtual nodes; adding a node moves K/N keys. Jump hash (Lamping & Veach 2014): O(log n) compute, no node IDs. Rendezvous (HRW): hash (key, node) for all nodes; pick max — handles weighted nodes naturally.
What is FLP and what does it mean for production systems?
Show answer
FLP impossibility (Fischer, Lynch, Paterson 1985): in a fully asynchronous network with even one crash, no deterministic protocol guarantees both safety and liveness. Real systems sidestep with timeouts (semi-synchrony assumption), randomization, or accept rare unavailability. Raft and Paxos prefer safety over liveness.
Why use vector clocks vs Lamport clocks?
Show answer
Lamport clock: total order, doesn't preserve causality fully. Vector clock: per-process counter array; captures full causality (a → b iff VC(a) < VC(b)). Cost: size O(N processes). Use vector when you need to detect concurrent vs causally-ordered events; Lamport when you only need total order.
Quorum math: R + W > N — why?
Show answer
If R + W > N, every read intersects every write (pigeonhole). So a read sees at least one replica that has the latest write. Strong-ish consistency without coordination. R = W = ⌈N/2⌉+1 is the typical balanced choice.
Explain CRDTs and give two examples.
Show answer
Conflict-free replicated data types: structures that converge under concurrent updates without coordination. G-Counter: per-replica counter, sum on read. OR-Set: tag adds with unique IDs; deletes only remove tags you've seen. Used in collaborative editors (Figma, Linear) and offline-first apps.
Why does Spanner need 2PC despite Paxos?
Show answer
Paxos is for replication within a Paxos group (one shard). Cross-shard transactions need 2PC across groups. Spanner: 2PC over Paxos — each "participant" in 2PC is a Paxos group, so the transaction is durable even with single-node failures.
What is sloppy quorum + hinted handoff?
Show answer
If a designated replica is down, the write goes to the next node in the ring (sloppy — outside the strict quorum). The receiving node holds the data as a "hint" and forwards to the original when it recovers. Tradeoff: improves write availability; risks stale reads on the original.
Phi-accrual vs heartbeat for failure detection?
Show answer
Heartbeat: ping with timeout; binary alive/dead. Phi-accrual (Hayashibara 2004): probabilistic — outputs a continuous suspicion level φ; threshold tuning per-app. Adapts to network jitter. Used in Cassandra, Akka.
Design a distributed lock service.
Show answer
Cluster of 5 nodes running Raft. Lock = key in replicated log with TTL + owner. Acquire: CAS (set if not exists); on fail, watch for delete event. Release: delete key. Crash safety: TTL releases abandoned locks. Use fencing tokens (monotonic) to prevent stale clients from acting after lease expiry.
Explain MapReduce in one paragraph.
Show answer
Distributed computation framework. User writes map(k, v) → list[(k, v)] and reduce(k, list[v]) → output. System partitions input across mappers, shuffles by key, runs reducers on grouped data. Handles fault tolerance (re-run failed tasks), data locality (run map close to data), straggler mitigation (speculative execution).
What is read-your-writes consistency?
Show answer
A user always reads their own latest writes (even in an eventually-consistent system). Implemented by routing reads to the same replica that handled writes, or by tracking client write timestamps and reading from a replica caught up past that timestamp.
What's PACELC and how does it extend CAP?
Show answer
PACELC (Abadi 2012): partitioned → choose C or A; else (no partition) → L (latency) or C. Most real systems trade off in normal operation too (not just during partitions). Spanner: CP / EC. Dynamo: AP / EL.

Companion: full ~104KB curriculum at ../distributed_systems_curriculum.md. Read DDIA in parallel.