Probability & statistics — the working engineer’s course
From counting to the CLT to A/B testing — the full probability-and-statistics stack behind ML interviews, in textbook order (Blitzstein/Hwang flavor) but ruthlessly practical. Every concept arrives the same way: a tiny example with real numbers first, then the formal version with every symbol glossed, then where it shows up in ML and interviews. Part IV is pure interview craft: the puzzle patterns and the rapid-fire bank.
What's in here
- How to use this + the map
- Counting without tears
- Axioms, conditioning, independence
- Bayes' rule, done properly
- Random variables, PMF / PDF / CDF
- The distribution zoo and how it hangs together
- Expectation, variance, and the indicator superpower
- Joints, conditionals, and the tower rule
- Inequalities, LLN, and the CLT
- Estimators: MLE, MAP, and how to judge them
- Confidence intervals & the bootstrap
- Hypothesis testing without the cargo cult
- A/B testing: statistics meets production
- Bayesian statistics in practice
- The puzzle playbook (the classics, with the patterns)
- Statistics for ML + Rapid Fire
This chapter orients you before the deep dives begin. You'll see the fundamental split between probability and statistics, learn exactly which four skill-clusters interviewers probe, and internalize five tactical reflexes that will fire automatically when you're under pressure. Everything on this page gets exercised repeatedly in later chapters.
Think of a coin with bias p. If someone hands you p = 0.7 and asks "what's the chance of 8+ heads in 10 flips?" — that is a probability question. You know the model; you predict data.
If instead someone hands you the sequence H H T H H H T H H H and asks "is this coin fair?" — that is a statistics question. You have data; you infer the model.
After surveying hundreds of ML interview reports, the questions cluster into four archetypes. Know the archetype and you know the playbook.
Trigger: The interviewer poses a probability or statistics problem without specifying the setup.
- State out loud which archetype it is: "This feels like a conditional-probability puzzle — let me define the sample space first."
- Clarify any ambiguous quantities (replacement? ordered? per-user or per-session?). Asking is a signal of rigor, not ignorance.
- Use the 5-reflex protocol (below) to move systematically.
Never: jump to a formula before defining what probability space you're working in.
These five moves cover the vast majority of probability puzzles and estimation questions. Internalize them as a mental checklist, not as a linear recipe — sometimes you'll use 1, 3, and 5; sometimes just 4.
The chapters are ordered so each one builds on the previous. Here is where you are headed and why the sequence is the way it is.
In everyday English these words are synonyms. In statistics they are NOT. Probability is a function of data given a fixed model: P(data | θ) treated as a random-variable statement. Likelihood is the same expression viewed as a function of θ given fixed data: L(θ) = P(data | θ). Likelihood is NOT a probability distribution over θ — it does not integrate to 1. Confusing the two is a common trap on MLE questions.
Nearly every "statistical thinking" probe is really asking whether you know the confounders. The safe template: "I see an association. Before concluding causation, I want to check (1) temporal order — does X precede Y? (2) alternative common causes — is there a Z driving both? (3) whether randomization or a natural experiment can help." Saying this template out loud signals strong statistical maturity.
- Probability: model → data. Statistics: data → model. Every ML system does both.
- Four interview archetypes: puzzles, estimator reasoning, testing/A/B, statistical thinking.
- Five reflexes: sample space → distribution → indicators → condition on first step → extreme cases.
- "Likelihood" is P(data|θ) viewed as a function of θ; it is NOT a probability over θ.
Probability and statistics are inverse problems. Interviews probe four skill clusters: puzzles, estimator reasoning, A/B/testing, and open-ended statistical judgment. Arm yourself with five reflexes — define the sample space, name the distribution, use indicators, condition on the first step, sanity-check extremes — and you'll have a systematic attack for any question this page throws at you.
Q1. What is the difference between probability and statistics? Give a concrete ML example of each.
Q2. What does a p-value of 0.03 mean? (Give the precise definition, not the common shortcut.)
Q3. An interviewer asks: "Is this a probability problem or a statistics problem?" You're given a trained Gaussian mixture model and asked to compute P(new point belongs to cluster 1). Which is it?
Q4. Which of the 5 reflexes would you fire first on: "What is the expected number of cards you must draw from a shuffled deck before seeing the first ace?"
Q5. "Likelihood" and "probability" — when do they differ and why does the distinction matter for MLE?
Q6. You see that a model's F1 score increased from 0.72 to 0.75 on the test set. Is this improvement real? What do you want to know?
Q7. What is Reflex 4 (condition on first step) and when does it help that other methods don't?
Q8. You are told that in a production ML system, the training loss went to zero but the test loss is high. Is this a probability problem, a statistics problem, or something else? Explain.
Q9. Explain in one sentence each what the 5 extremes-check reflex would catch in these scenarios: (a) you compute P(at least one head in 10 flips) = 1.024, (b) you compute E[number of 6s in one die roll] = 0.
Q10. A colleague says: "We saw a statistically significant lift in our A/B test, so we shipped the feature." What questions would you ask before agreeing this was the right call?
Counting is the arithmetic engine underneath probability: once you can enumerate a sample space, you can divide to get probabilities. This chapter builds from the simplest rule (multiply) to permutations, combinations, stars-and-bars, and inclusion-exclusion — always example first, formula second. Classic interview traps (committees vs arrangements, "at least one" vs direct counting) are called out explicitly.
Concrete first. You pick an outfit: 3 shirts and 4 pants. How many outfits? Draw a tree: for each of the 3 shirts you have 4 pant choices, giving 3 × 4 = 12 outfits. You never need to list them all.
Now generalize. If task 1 has n₁ outcomes and, regardless of what happened in task 1, task 2 has n₂ outcomes, then the two-step process has n₁ · n₂ outcomes. The "regardless" is key — outcomes of different steps must be independent of each other for simple multiplication to work.
Three books on a shelf — your running example. Call the books A, B, C. How many ways to arrange all three? You have 3 choices for position 1, then 2 remaining for position 2, then 1 for position 3: 3 × 2 × 1 = 6. The six arrangements: ABC, ACB, BAC, BCA, CAB, CBA. We'll revisit this throughout.
Concrete first. From the 3 books A, B, C, choose 2 and arrange them in a row. Your choices: AB, BA, AC, CA, BC, CB — that's 6 ordered pairs. Using the multiplication rule: 3 choices for spot 1, 2 remaining choices for spot 2 → 3 × 2 = 6.
General formula. Picking k items in order from n distinct items without replacement:
For our example: P(3,2) = 3!/1! = 6. Full arrangement of all 3: P(3,3) = 3! = 6 (confirming the earlier count).
n! = P(n,n) = arrangements of ALL n items. P(n,k) is arrangements of a SUBSET of size k. Forgetting the (n−k)! denominator when only choosing k ≤ n items is the most common permutation error.
Concrete first. From books A, B, C, choose 2 to put in a bag (order doesn't matter). The choices: {A,B}, {A,C}, {B,C} — just 3. Compare to the 6 ordered pairs above: we've divided by 2! = 2 because each unordered pair was counted twice (AB and BA are the same bag).
General formula. Choosing k items from n without replacement, order not mattering:
For our example: C(3,2) = 3!/(2!·1!) = 3. Confirm: {A,B}, {A,C}, {B,C}. ✓
Key symmetry: C(n,k) = C(n, n−k). Choosing which k to include is the same as choosing which n−k to exclude. C(10,3) = C(10,7) = 120.
Poker hands warm-up. A 52-card deck, 5-card hand, order doesn't matter: C(52,5) = 52!/(5!·47!) = 2,598,960 possible hands. That's the denominator for every poker probability.
Trigger: You're deciding which formula to use.
- Ask: "Does the order of selection matter here?" If picking a president, VP, and secretary from 10 people → YES (different people in different roles) → permutation P(10,3).
- If picking a 3-person committee from 10 people → NO (same 3 people in any order = same committee) → combination C(10,3).
Never: use C(n,k) when the problem specifies roles or positions — that's the single most common counting error.
Counting problems come in exactly four flavors based on two binary questions: Is selection ordered? Is replacement allowed? The grid below maps each case to its formula and a canonical example.
| With replacement | Without replacement | |
|---|---|---|
| Ordered | nk Pick k items from n, each time putting it back; sequence matters. Example: 4-digit PIN from digits 0–9: 104 = 10,000. |
P(n,k) = n!/(n−k)! Pick k items from n, no repeats; sequence matters. Example: Gold/Silver/Bronze from 8 athletes: P(8,3) = 336. |
| Unordered | C(n+k−1, k) ("stars and bars") Pick k items from n types, repeats ok, order irrelevant. Example: 3 donuts from 4 flavors: C(6,3) = 20. |
C(n,k) = n!/(k!(n−k)!) Pick k items from n, no repeats, order irrelevant. Example: 5-card poker hand: C(52,5) = 2,598,960. |
Trigger: Any "how many ways" or "what is P(event)" combinatorics question.
- Ask: "Ordered or unordered?" (roles → ordered; groups/committees/sets → unordered)
- Ask: "With or without replacement?" (dice, coins → with; cards from a deck, people → without)
- Read off the formula from the matching cell.
- Sanity-check: for small n and k, can you enumerate by hand and confirm?
Never: reach for a formula before placing the problem in the grid.
The donut problem. You walk into a shop with 4 flavors (glazed, chocolate, jelly, plain). You want to pick 3 donuts; you can pick any flavor more than once; the bag is unordered. How many distinct bags?
Represent each selection as 3 stars placed among 4 bins separated by 3 bars. Example: ★ | ★★ | | means 1 glazed, 2 chocolate, 0 jelly, 0 plain. The number of ways to arrange 3 stars and 3 bars among 6 positions is:
You can verify the small case: with 2 flavors and 2 donuts, C(3,2) = 3 bags: {GG, GC, CC}. ✓
Stars-and-bars counts unordered multisets with replacement. The multinomial coefficient C(n; n₁,n₂,…,nk) = n!/(n₁!n₂!…nk!) counts ordered sequences (arrangements) with a fixed number of each type — e.g., arrangements of the letters in MISSISSIPPI. Different grid cell, different formula.
The problem with "or." How many integers from 1 to 30 are divisible by 2 or by 3? Naively: 15 divisible by 2, 10 divisible by 3. Adding gives 25 — but that double-counts the 5 numbers divisible by 6. Correct answer: 15 + 10 − 5 = 20.
This is inclusion-exclusion for two sets: |A ∪ B| = |A| + |B| − |A ∩ B|. For three sets:
Derangements — the classic application. A derangement is a permutation where no item is in its original position (think: a secret Santa where no one draws themselves). Let D(n) be the count. For n=3 books: total permutations = 6; the derangements of [A,B,C] are [B,C,A] and [C,A,B] — only 2.
Using inclusion-exclusion, the general formula is:
For n=3: D(3) = 3!(1 − 1 + 1/2 − 1/6) = 6 · (1/3) = 2. ✓
The surprise is that you're counting pairs, not people. With 23 people there are C(23,2) = 253 pairs, each with a 1/365 chance of matching. 253/365 ≈ 0.69 — a rough indicator that a match is likely. (The exact calculation gives 0.507 at n=23.) Any interview question involving "how many people until probability of collision exceeds 50%" uses this logic — and it generalizes to hash tables, birthday attacks in cryptography, and duplicate-detection in data pipelines.
C(n,0) = 1 (one way to choose nothing). C(n,n) = 1 (one way to choose everything). C(n,1) = n (n ways to choose one item). These three anchors let you sanity-check any combination formula: if your formula gives something other than these values at the extremes, you have an error.
Also: C(n,k) = 0 whenever k > n or k < 0. You can't choose more items than exist.
When items are drawn without replacement, the denominator of your probability calculation should also be "without replacement" — i.e., the number of ordered or unordered subsets, not the number of sequences with repeats. A common error: computing P(royal flush) using 52^5 in the denominator (with replacement) instead of C(52,5) = 2,598,960 (without replacement). The card game uses without replacement; always match numerator and denominator counting method.
- Four-cell grid: ordered/unordered × with/without replacement. Map the problem before touching a formula.
- Permutations P(n,k) = n!/(n−k)!: ordered, no replacement. Combinations C(n,k) = n!/(k!(n−k)!): unordered, no replacement.
- Stars-and-bars C(n+k−1,k): unordered, with replacement (multisets).
- Inclusion-exclusion: |A∪B| = |A|+|B|−|A∩B|. "At least one" → complement.
- D(n) ≈ n!/e — about 37% of permutations are derangements.
Counting is multiplication applied systematically. The 4-case grid (ordered vs unordered, replacement vs not) is the decision tree — land the problem in the right cell and the formula follows. Permutations count ordered subsets; combinations count unordered subsets; stars-and-bars handles unordered multisets. Inclusion-exclusion fixes overcounting in overlapping groups. The complement trick ("at least one" = 1 − P(none)) is almost always simpler than direct counting.
Q1. In how many ways can 5 people sit in a row? In a circle?
Q2. A password must be exactly 8 characters, each a lowercase letter or digit (36 characters total), with at least one digit. How many valid passwords are there?
Q3. How many 5-card poker hands contain exactly two pairs (and nothing better)?
Q4. How many arrangements of the letters in "MISSISSIPPI" are there?
Q5. You have 10 different books. In how many ways can you choose 3 for the left shelf and 3 for the right shelf (the two shelves are labeled; order on each shelf doesn't matter)?
Q6. What is the birthday problem, and why does the answer surprise people?
Q7. A bag has 5 red and 3 blue marbles. You draw 3 without replacement. What is P(exactly 2 red)?
Q8. You roll two dice. In how many outcomes does the sum equal 7? What is the probability?
Q9. What is a derangement, and what fraction of all permutations are derangements as n → ∞?
Q10. A student randomly guesses on a 5-question multiple-choice test (4 options each). What is the probability of getting at least 3 correct?
Q11. How many ways can you distribute 7 identical candies among 3 children so that each child gets at least 1?
This chapter builds the three pillars every probability calculation rests on: the axioms that make probability well-defined, the conditioning operation that lets you reason with partial information, and the independence property that makes large models tractable. Every later chapter in this page depends on these ideas — get them cold.
Fix a running example we'll use throughout: roll two fair six-sided dice. Call them die 1 and die 2.
The sample space Ω is the set of all outcomes the experiment can produce. Here:
An event is any subset of Ω — a collection of outcomes you want to lump together and ask "what is the chance of this?"
Probability is a function P that assigns a number to each event. Kolmogorov (1933) gave three rules that pin down every valid probability measure. Memorise the plain-words version; the formal version will follow naturally.
From just these three axioms you can derive everything: the complement rule, inclusion-exclusion, the union bound, conditional probability. They are not arbitrary — they are the minimal rules that make probability coherent.
Two dice — quick sanity checks:
- P(sum = 7) = 6/36 = 1/6. Six favourable outcomes, 36 equally likely total.
- P(sum ≠ 7) = 1 − 1/6 = 5/6. Complement rule.
- P(A ∪ C) where A = {sum=7} and C = {doubles}: A ∩ C = ∅ (no double sums to 7), so P(A ∪ C) = 6/36 + 6/36 = 12/36 = 1/3.
You roll the two dice behind a screen. A friend peeks and tells you "die 1 is a 3." Knowing this, what is the probability the sum is 7?
Before the hint, 36 outcomes were possible. After the hint, only 6 outcomes survive — those where die 1 = 3: (3,1),(3,2),(3,3),(3,4),(3,5),(3,6). Of these six, exactly one has sum 7: (3,4). So the answer is 1/6.
That reasoning is the entire mechanism of conditional probability. The hint shrinks the universe from 36 outcomes to 6, and you re-do your calculation inside the smaller universe.
Dice example formally: A = {sum = 7}, B = {die 1 = 3}.
- P(A ∩ B) = P({(3,4)}) = 1/36
- P(B) = 6/36 = 1/6
- P(A | B) = (1/36) / (1/6) = 1/6. Matches the counting argument.
Trigger: any question of the form "given X, what is the probability of Y?"
- Write down P(Y | X) = P(Y ∩ X) / P(X).
- Find both probabilities on the original (unconditional) sample space.
- Divide. Never redefine the sample space — just reweight it.
Never: confuse P(Y | X) with P(X | Y). These are generally not equal — that swap is the prosecutor's fallacy and costs people interviews.
Rearranging the conditional probability formula gives the chain rule (also called the multiplication rule):
This extends to any number of events:
Dice example: P(die 1 = 3 AND sum = 7) = P(sum = 7 | die 1 = 3) · P(die 1 = 3) = (1/6)(1/6) = 1/36. ✓
The chain rule is the backbone of probabilistic graphical models and language model factorisation. GPT-style models generate text by computing P(tokent | token1, …, tokent-1) for each position — that is literally the chain rule applied left-to-right.
After two dice are rolled, does knowing die 1 = 3 tell you anything about whether die 2 = 5? No — the dice don't talk to each other. Formally:
Check with dice: A = {die 1 = 3}, B = {die 2 = 5}.
- P(A) = 1/6, P(B) = 1/6.
- P(A ∩ B) = P({(3,5)}) = 1/36 = (1/6)(1/6). ✓ Independent.
Contrast: A = {sum = 7}, B = {die 1 = 3}.
- P(A) = 1/6, P(B) = 1/6.
- P(A ∩ B) = 1/36 = (1/6)(1/6). Equal — so sum = 7 and die 1 = 3 are independent!
- Sanity check: P(sum = 7 | die 1 = 3) = 1/6 = P(sum = 7). Knowing die 1 doesn't change the probability of sum 7? Yes — because die 2 is equally likely to be any of {1,…,6}, and exactly one of those completes a sum of 7.
Non-independent events: A = {sum = 7}, D = {die 1 = 1}.
- P(D) = 1/6, P(A) = 1/6, P(A ∩ D) = P({(1,6)}) = 1/36.
- P(A | D) = (1/36)/(1/6) = 1/6 = P(A). Still independent! (Every specific value of die 1 leaves the same 1-in-6 chance of sum 7.)
Try: A = {sum = 7}, E = {sum = 8}. P(A) = 6/36, P(E) = 5/36, P(A ∩ E) = 0 (a roll can't sum to both 7 and 8). Product = 6/36 · 5/36 ≠ 0. Not independent. They are, however, mutually exclusive — which leads directly to the big confusion.
This is the most commonly confused pair in basic probability. They sound similar but mean almost opposite things.
| Property | Mutual exclusivity | Independence |
|---|---|---|
| Meaning | A and B cannot both happen: A ∩ B = ∅ | A and B carry no information about each other: P(A|B) = P(A) |
| Joint probability | P(A ∩ B) = 0 | P(A ∩ B) = P(A)·P(B) |
| Dice example | {sum=7} and {sum=8} — can't both occur | {die 1 = 3} and {die 2 = 5} — no information flows |
| Can they co-exist? | Only if one has probability 0 | Yes, generically |
| Intuition failure | People think "separate events" = independent | People think independent = can't happen together |
The killer insight: if P(A) > 0 and P(B) > 0, then mutually exclusive events are never independent. Knowing A occurred (P(A) > 0 worth of probability happened) guarantees B did not — so B's probability given A is 0, not P(B). Information was conveyed. They are maximally dependent.
One-liner to remember: "Mutually exclusive events are the most dependent events there are — if one happens, the other definitely didn't."
A and B are conditionally independent given C when:
Naive Bayes classifiers assume that all features are conditionally independent given the class label. This is almost certainly false in reality, yet the model often works well in practice — understanding why requires conditional independence.
Dice micro-example: Let A = {die 1 is odd}, B = {die 2 is odd}, C = {sum is even}. Given that the sum is even, die 1 and die 2 must both be odd or both be even — knowing die 1 is odd forces die 2 to be odd. So A and B are not conditionally independent given C, even though they are marginally independent.
Suppose you want P(some event A) but it's hard to compute directly. The trick: find a partition of Ω — a set of mutually exclusive, exhaustive events B1, B2, …, Bk that cover everything — then compute P(A | Bi) separately for each case and average.
Trigger: you want P(A) and direct counting is messy.
- Ask: "What would I wish I knew before computing P(A)?" Name that hidden variable B.
- List the possible values of B: B = b1, b2, …
- Compute P(A | B = bi) for each value — these should be easy.
- Weight by P(B = bi) and sum: P(A) = Σ P(A|B=bi)·P(B=bi).
Never: skip the partition check. If the Bi are not exhaustive, you'll undercount probability mass.
Dice worked example: A = {sum is prime}. Primes in range 2–12: {2, 3, 5, 7, 11}. Partition on die 1 value (Bi = {die 1 = i}):
- Die 1 = 1: need die 2 ∈ {1,2,4,6} (sum ∈ {2,3,5,7}). Count: 4. P(A|die1=1) = 4/6.
- Die 1 = 2: need die 2 ∈ {1,3,5} (sum ∈ {3,5,7}). Count: 3. P(A|die1=2) = 3/6.
- Die 1 = 3: need die 2 ∈ {2,4} (sum ∈ {5,7}). Count: 2. P(A|die1=3) = 2/6.
- Die 1 = 4: need die 2 ∈ {1,3} (sum ∈ {5,7}). Count: 2. P(A|die1=4) = 2/6.
- Die 1 = 5: need die 2 ∈ {2,6} (sum ∈ {7,11}). Count: 2. P(A|die1=5) = 2/6.
- Die 1 = 6: need die 2 ∈ {1,5} (sum ∈ {7,11}). Count: 2. P(A|die1=6) = 2/6.
P(A) = (1/6)(4+3+2+2+2+2)/6 = 15/36 = 5/12. (We can verify by counting directly: 15 prime-sum pairs out of 36.)
Direct counting of "at least one six" is error-prone — you'd have to subtract overlaps. The complement approach is the standard tool.
Complement rule: P(at least one 6 in 4 rolls) = 1 − P(no 6 in 4 rolls).
Each roll is independent. P(no 6 on one roll) = 5/6. So:
The Chevalier de Méré story (3 sentences): In 1654, the French gambler Antoine Gombaud (Chevalier de Méré) bet that he'd see at least one double-six in 24 rolls of two dice, reasoning it should be as likely as one six in four rolls of one die. He was losing money and asked Blaise Pascal why. Pascal showed him that P(at least one double-six in 24 rolls) = 1 − (35/36)²⁴ ≈ 0.491 — just under 50% — while the analogous single-die game had probability ≈ 0.518, and the two events are not equivalent because the probability of a double-six (1/36) differs from the probability of a six (1/6) in a way that doesn't scale linearly with the number of rolls.
This correspondence between Pascal and Fermat founded modern probability theory. The lesson for interviews: when you see "at least one," flip to the complement. The complement is almost always one clean product.
P(at least one event happens in n trials) = 1 − P(event fails on all n trials) = 1 − (1−p)^n when trials are independent with per-trial probability p. Never try to add up the direct cases — the overlaps will kill you.
- Sample space = all possible outcomes; event = subset of sample space. Three axioms: non-negativity, normalisation, countable additivity.
- P(A | B) = P(A ∩ B) / P(B) — shrink the universe to outcomes where B occurred.
- Independence: P(A ∩ B) = P(A)P(B). Mutual exclusivity: P(A ∩ B) = 0. Two positive-probability events cannot be both. Mutually exclusive events are maximally dependent.
- Law of total probability: partition on what you wish you knew, average the conditionals. "At least one" → complement.
"Are independent events mutually exclusive?" — the trap. Answer: no, if both have positive probability. State the definition of each, give the two-dice example, and note that mutual exclusivity implies dependence (not independence).
"What is conditional probability geometrically?" — describe the shrink-the-universe picture: you restrict the sample space to outcomes where the condition holds, then re-normalise.
Q1. State the three axioms of probability and derive the complement rule from them.
Q2. You roll two fair dice. What is P(sum = 8 | die 1 is even)?
Q3. Are {sum = 7} and {die 1 = 4} independent? Prove it.
Q4. If A and B are mutually exclusive and both have positive probability, are they independent? Why or why not?
Q5. Use the law of total probability to find P(sum of two dice is odd).
Q6. What is P(at least one 6 in 6 rolls of a fair die)? Is this greater than, equal to, or less than 2/3?
Q7. A bag has 3 red and 2 blue marbles. You draw two without replacement. What is P(both red)?
Q8. Define conditional independence. Give an example where A and B are marginally dependent but conditionally independent given C.
Q9. Suppose P(A) = 0.4, P(B) = 0.5, P(A | B) = 0.6. Find P(A ∩ B), P(B | A), P(A ∪ B).
Q10. The Chevalier de Méré believed P(≥1 six in 4 rolls) = P(≥1 double-six in 24 rolls). Why is this wrong, and what are the actual values?
Q11. In a machine learning context, where does the chain rule of probability appear?
Q12. P(A) = 0.3 and P(B) = 0.4. What is the range of possible values for P(A ∪ B)?
Probability is defined by three axioms; everything else is derived. Conditioning shrinks the universe — P(A|B) = P(A∩B)/P(B). Independence means information doesn't flow — P(A∩B) = P(A)P(B). Mutual exclusivity (P(A∩B)=0) is the opposite of independence when both events have positive probability. The law of total probability says: partition on what you wish you knew, then average. "At least one" problems flip to the complement.
Bayes' rule converts "what is the probability of a positive test given the disease?" into "what is the probability of the disease given a positive test?" — the question you actually care about. This chapter derives the rule from first principles, builds the 10,000-person counting table that makes base-rate neglect impossible, introduces the odds form for fast mental arithmetic, and connects the whole machinery to Naive Bayes, MAP estimation, and Bayesian A/B testing.
From Chapter 3, we know that for any two events A and B:
Set the two right-hand sides equal and divide both sides by P(B):
There is no magic here — it is literally just rearranging the chain rule. The insight is in what you choose to call A and B.
A disease affects 1% of the population. A test for this disease has:
- Sensitivity (true positive rate): 99% — if you have the disease, the test is positive 99% of the time.
- Specificity (true negative rate): 95% — if you don't have the disease, the test is negative 95% of the time. Equivalently, the false positive rate is 5%.
You test positive. What is the probability you have the disease?
Most people, including many physicians in studies, say "about 99%" — because the test is "99% accurate." The actual answer is about 17%. Here is why.
Instead of manipulating fractions, imagine 10,000 people drawn randomly from the population and trace them through the test.
| Test positive (+) | Test negative (−) | Total | |
|---|---|---|---|
| Have disease | 99 true positives |
1 false negatives |
100 (1% of 10,000) |
| No disease | 495 false positives |
9,405 true negatives |
9,900 (99% of 10,000) |
| Total | 594 | 9,406 | 10,000 |
How to read the table:
- 100 people have the disease. The test catches 99% of them → 99 true positives, 1 false negative.
- 9,900 people don't have the disease. The test wrongly flags 5% of them → 495 false positives, 9,405 true negatives.
- Total positive tests: 99 + 495 = 594.
- Of these 594 positive tests, only 99 are real disease cases.
The base-rate punchline: The 495 false positives swamp the 99 true positives because the disease is rare. A positive test shifts your probability from 1% (the prior) to ~17% (the posterior) — a 17× increase, but still well short of certainty. If the disease had 10% prevalence, the posterior would be ~69%. The prior matters enormously.
The error: focusing only on the sensitivity (99%) and ignoring that the disease is rare (1%). People mentally replace "P(disease | +)" with "P(+ | disease)" — an instance of the prosecutor's fallacy (confusing the probability of evidence given guilt with the probability of guilt given evidence).
Why it happens: the likelihood P(+ | disease) = 0.99 is a vivid number. The prior P(disease) = 0.01 is abstract. Human minds anchor on the vivid number and fail to multiply by the base rate.
The fix: always build the 2×2 table with a large round number (10,000 works well). Fill in the four cells using the prior and the two error rates. Read off the conditional probabilities directly from the cell counts. No formula, no algebra, no errors.
ML relevance: this exact phenomenon appears in anomaly detection, fraud detection, and rare-event classifiers. When the positive class is rare (0.1% fraud rate), even a 99% accurate model produces mostly false positives among its flagged predictions. Always check precision, not just recall or accuracy.
Trigger: any question giving you a base rate, a sensitivity (true positive rate), and a false positive rate, asking for P(condition | evidence).
- Write down N = 10,000 (or any large round number).
- Fill the disease row: N × prevalence people have the disease. Of these, sensitivity × (N × prevalence) test positive.
- Fill the no-disease row: N × (1 − prevalence) people don't. Of these, false positive rate × N × (1 − prevalence) test positive.
- Add the two positive cells to get total positives.
- Divide true positives by total positives. Done.
Never: try to do this in your head with fractions and give up. The table is foolproof and fast to sketch on a whiteboard.
There is a cleaner version of Bayes' rule that makes mental updating instant. Define the odds of an event A as:
Now write Bayes' rule for A vs Ac both conditioned on evidence B:
Medical test in odds form:
- Prior odds = P(disease)/P(no disease) = 0.01/0.99 ≈ 1:99.
- Likelihood ratio = P(+|disease)/P(+|no disease) = 0.99/0.05 = 19.8.
- Posterior odds = 19.8 × (1/99) ≈ 0.2 = 1:5.
- Posterior probability = 1/(1+5) ≈ 0.167. ✓ Matches the table.
The odds form is the fast mental tool: the likelihood ratio is a single number that tells you how strongly the evidence shifts your belief. A likelihood ratio of 1 means the evidence is neutral. Greater than 1 means evidence supports A. Less than 1 means evidence supports Ac.
Suppose you run a second, independent test on the same patient (a different biomarker with LR = 25 for a second positive). You don't need to redo the full Bayes calculation from scratch. The posterior odds after the first test become the new prior odds, and you multiply by the second likelihood ratio.
After test 1 (LR₁ = 19.8): odds ≈ 1:5, probability ≈ 17%.
After test 2 (LR₂ = 25): odds = 25 × (1/5) = 5:1, probability ≈ 5/6 ≈ 83%.
Two positive tests on a rare disease go from 17% to 83% — still not certainty, but now you'd treat.
Spam filter as sequential Bayes: A Naive Bayes spam classifier does exactly this. Each word in an email is a "test." The prior P(spam) is the base rate of spam in your inbox. Each word w updates the odds by a likelihood ratio P(w|spam)/P(w|not spam). Words like "Viagra" have very high LR (much more common in spam); words like "meeting" have LR ≈ 1 (equally common in both). The model multiplies all these likelihood ratios together, which in log space is a sum — and that's why Naive Bayes classifiers produce a linear function of log word counts.
Bayes' rule is not just a medical-testing curiosity — it is the backbone of three major ML concepts that appear in nearly every interview at a top lab.
Naive Bayes classifiers. Given a document with words w1, …, wn, the classifier computes:
Training: count word frequencies per class. Inference: multiply the prior by each word's likelihood. Add-one (Laplace) smoothing prevents zero probabilities for unseen words. In log space the product becomes a sum — making Naive Bayes a linear classifier in log word counts.
MAP estimation. Maximum A Posteriori estimation picks the parameter θ that maximises the posterior:
Concrete example: logistic regression with L2 regularisation is MAP estimation under a Gaussian prior on the weights. The strength λ of regularisation equals 1/σ² where σ² is the prior variance. Stronger prior (smaller σ) → higher λ → more regularisation → weights shrunk harder toward zero.
Bayesian A/B testing. Instead of a p-value, compute P(variant B is better than variant A | observed data) directly. Start with a prior over conversion rates (e.g., Beta(1,1) for ignorance), observe conversions, update to a Beta posterior, and compute the probability that the posterior draw from B exceeds the draw from A. There is no fixed sample size, no peeking problem in the classical sense — you just read off posterior probabilities. Chapter 14 covers this in full; the Bayes machinery here is the foundation.
The error: swapping a conditional probability with its reverse. "The test is 99% accurate" (P(+|disease) = 0.99) is not the same as "99% chance you have the disease given a positive test" (P(disease|+)). These two quantities can be wildly different when the prior is lopsided.
In court: "The probability that an innocent person matches the DNA evidence is 1 in a million" does NOT mean "the probability this defendant is innocent is 1 in a million." If there are 10 million people in the city, you'd expect 10 innocent matches — the prior of innocence (nearly 1) has to be multiplied through.
In ML: precision ≠ recall. A model with 95% recall (P(predict positive | actually positive)) can still have low precision (P(actually positive | predict positive)) when the positive class is rare. Always check both, and draw the 2×2 confusion matrix.
"Walk me through Bayes' rule on a concrete example." — Don't just state the formula. Build the 2×2 table with 10,000 people, read off the answer, then derive it algebraically to show you understand both representations. State what the prior, likelihood, and posterior are in your specific problem.
"Why is regularisation equivalent to MAP?" — State that the log posterior = log likelihood + log prior. Show that L2 regularisation emerges from a Gaussian prior and L1 from a Laplace prior. Give λ = 1/σ² explicitly. This is a Staff-level question that separates people who memorised regularisation from those who understand it.
"Why does Naive Bayes work despite the independence assumption being false?" — The classification boundary only requires that the posterior ranking of classes is correct, not that the posterior probabilities are calibrated. Even with wrong probabilities, the argmax often picks the right class. Naive Bayes gives poorly calibrated probabilities but surprisingly good decisions.
- Bayes' rule = two forms of the joint probability set equal; posterior ∝ likelihood × prior.
- Base-rate neglect: low prevalence + non-zero false positive rate → most positives are false. Always build the table.
- Odds form: posterior odds = likelihood ratio × prior odds. LR is the key number — how much more probable is the evidence under H than under ¬H.
- Sequential updating: multiply likelihood ratios. Log odds form turns this into a sum — that's Naive Bayes and the intuition behind log-loss.
- MAP = MLE + log prior = regularised MLE. Gaussian prior → L2; Laplace prior → L1.
Trigger: you see a base rate (prevalence, prior), a sensitivity or true positive rate, and a false positive rate; you must find P(condition | observed evidence).
- Write N = 10,000.
- Diseased row: N × prevalence people; of those, sensitivity × that many test positive.
- Healthy row: N × (1 − prevalence) people; of those, false positive rate × that many test positive.
- Posterior = (true positives) / (true positives + false positives).
- State the answer and note: "the low posterior is driven by the low prior, not test quality."
Bonus: convert to odds form to verify: LR = sensitivity / false positive rate; posterior odds = LR × prior odds.
Never: quote the sensitivity as the answer (P(+|disease) ≠ P(disease|+)); never forget to state the prior.
Bayes' rule is rearranged conditional probability: posterior ∝ likelihood × prior. The 10,000-person table makes base-rate neglect impossible — build it whenever prevalence appears. The odds form (posterior odds = LR × prior odds) is the fast mental tool; sequential Bayes multiplies likelihood ratios and underlies Naive Bayes classifiers. MAP estimation IS regularised MLE: Gaussian prior → L2, Laplace → L1. Every "given a positive test" question, every spam filter, and every regularised model in ML is running Bayes' rule under the hood.
Q1. A disease has 0.5% prevalence. A test has 90% sensitivity and 8% false positive rate. You test positive. What is P(disease | +)?
Q2. What is the difference between P(A | B) and P(B | A)? Give an example where confusing them leads to a wrong answer.
Q3. Explain the odds form of Bayes' rule and why it is useful.
Q4. You run two independent tests on a patient. Test 1: LR = 15 (positive). Test 2: LR = 0.2 (negative). Prior prevalence is 2%. What is the posterior probability of disease?
Q5. How does MAP estimation relate to regularised maximum likelihood? Be concrete about L1 and L2.
Q6. Why is Naive Bayes called "naive," and why does it work anyway despite the violated assumption?
Q7. A spam filter sees the word "Viagra" in an email. The prior P(spam) = 0.3. P("Viagra"|spam) = 0.15, P("Viagra"|not spam) = 0.001. What is P(spam | "Viagra" appears)?
Q8. What happens to the posterior P(disease | +) as the false positive rate approaches 0? As it approaches 1?
Q9. An interviewer asks: "You detect an anomaly in our fraud system. The fraud rate is 0.05%, the detector has 95% sensitivity, and 1% FPR. Should you investigate this flag?" Walk through the reasoning.
Q10. Derive why, in the odds form of Bayes' rule, the marginal likelihood P(B) cancels out.
Q11. (Hard) A Bayesian A/B test observes 80 conversions out of 200 visitors for variant A, and 90 out of 200 for variant B. Priors are Beta(1,1) for both. What are the posteriors, and how do you compute P(B beats A)?
Q12. (Brutal) How does the likelihood ratio relate to log-loss in binary classification? Derive the connection.
Probability is about events; random variables let us assign numbers to those events so we can compute means, variances, and expectations. This chapter builds the three universal descriptors — PMF (discrete), PDF (continuous), and CDF (both) — and shows how to simulate any distribution in four lines of Python. These objects appear in every ML model: loss functions are expectations, activation outputs are random variables, sampling from a neural network posterior uses inverse-CDF.
Roll a fair six-sided die. The sample space is Ω = {1,2,3,4,5,6}. Define X(ω) = ω. That's it — X is a function from outcomes to real numbers. Nothing mysterious.
Now define Y = 1 if X is even, 0 otherwise. Y is a different random variable on the same experiment. One experiment, infinitely many possible RVs.
Toy example first. Let Z be the net gain from a game: you win \$2 with prob 1/4, win \$0 with prob 1/2, lose \$1 with prob 1/4.
| z | −1 | 0 | 2 |
|---|---|---|---|
| P(Z=z) | 1/4 | 1/2 | 1/4 |
That table is the PMF. Two rules it must satisfy:
Check: 1/4 + 1/2 + 1/4 = 1. ✓
For a continuous RV X, we cannot list P(X=x) — it is always zero. Instead, probability lives in intervals:
Two rules for a valid PDF:
Consider Uniform(0, 0.5): f(x) = 2 for x ∈ [0, 0.5]. The density is 2, which is > 1. That's fine — density is not probability. Probability is area: ∫₀^0.5 2 dx = 1. ✓
The density at a point tells you how concentrated probability is near that point, not the probability itself.
P(X = 0.5) = ∫₀.₅^0.5 f(x) dx = 0. This does NOT mean X=0.5 is impossible — it just means the probability of hitting exactly 0.5 (out of infinitely many real numbers) is zero. What matters is probability over intervals.
Both discrete and continuous RVs have a CDF:
Properties every CDF must have:
- Non-decreasing: if a ≤ b then F(a) ≤ F(b).
- Right-continuous: F(x) = lim_{t→x⁺} F(t).
- Limits: F(−∞) = 0, F(+∞) = 1.
Discrete CDF example. For the game above (Z ∈ {−1, 0, 2}):
| x | x < −1 | −1 ≤ x < 0 | 0 ≤ x < 2 | x ≥ 2 |
|---|---|---|---|---|
| F(x) | 0 | 1/4 | 3/4 | 1 |
The CDF jumps at each mass point; the jump size equals the PMF value there.
Going back: PDF = dF/dx (where it exists); PMF = jump size at each point.
The p-th quantile (or p-th percentile) of X is the smallest value q such that F(q) ≥ p:
Key quantiles to know: Q(0.5) = median; Q(0.25) = first quartile; Q(0.75) = third quartile. For a standard Normal, Q(0.975) ≈ 1.96 — the famous "1.96σ" CI boundary comes straight from this.
Given X with known distribution, define Y = g(X). What is Y's distribution?
CDF method (always works): F_Y(y) = P(Y ≤ y) = P(g(X) ≤ y). Invert g, find the x-region, integrate f_X.
Jacobian formula (differentiable, monotone g):
Tiny example. X ~ Uniform(0,1); Y = −log X. Then g⁻¹(y) = e⁻ʸ and |d/dy e⁻ʸ| = e⁻ʸ. So f_Y(y) = 1 · e⁻ʸ = e⁻ʸ for y > 0 — that is the Exponential(1) density. This is how you transform a Uniform into an Exponential.
The inverse-transform method: if U ~ Uniform(0,1) and F is any CDF, then X = F⁻¹(U) has distribution F. This is the foundation of all basic random-number generation.
import numpy as np
# Sample from Exponential(lambda=2) using inverse-CDF
rng = np.random.default_rng(42)
U = rng.uniform(0, 1, size=10_000) # Step 1: uniform samples
lam = 2.0
X = -np.log(U) / lam # Step 2: apply F_inv(u) = -log(u)/lambda
# X now has distribution Exponential(2)
print(X.mean(), X.var()) # Should be ~0.5, ~0.25
Why it works: P(F⁻¹(U) ≤ x) = P(U ≤ F(x)) = F(x) since U is Uniform(0,1). The CDF of F⁻¹(U) is exactly F. ✓
Trigger: "How would you draw samples from [some custom distribution]?"
- Write down the CDF F(x).
- Invert it analytically: solve F(x) = u for x.
- Draw u ~ Uniform(0,1), return x = F⁻¹(u).
- If F⁻¹ has no closed form: mention rejection sampling or MCMC as fallbacks.
Never: skip the inversion step — interviewers want to see you know why this works (the Uniform pushforward argument), not just the recipe.
"Can a PDF value be greater than 1?" — Yes. Density is not probability. Show Uniform(0, 0.1) where f(x) = 10.
"What's the difference between a PMF and a PDF?" — PMF gives actual probabilities at points (sums to 1); PDF gives density (integrates to 1, point probabilities are zero).
"Why is the CDF called the 'universal object'?" — Because it exists for every RV (discrete, continuous, mixed) and you can recover the PMF or PDF from it.
- RV = function from Ω to ℝ. PMF sums to 1; PDF integrates to 1.
- CDF F(x) = P(X ≤ x) works for all RV types; non-decreasing, right-continuous.
- Density CAN exceed 1 — only areas (integrals) are probabilities.
- Inverse-CDF sampling: draw U ~ Uniform(0,1), return F⁻¹(U) → has distribution F.
Q1. Define a random variable formally. What distinguishes it from a random outcome?
Q2. A PDF f(x) = c·e^{−2x} for x ≥ 0, 0 otherwise. Find c and P(X > 1).
Q3. Explain why P(X = x) = 0 for a continuous RV but individual outcomes still "happen".
Q4. What four properties must a CDF satisfy? Why does it have to be right-continuous?
Q5. You want to sample from a distribution with CDF F(x) = 1 − e^{−x²} for x ≥ 0. Give the algorithm.
Q6. What is the 97.5th percentile of the standard Normal, and why does it appear in confidence intervals?
Q7. If X has CDF F, what is the distribution of Y = F(X)?
Q8. What is the Jacobian formula for transforming a density, and when does it fail?
Q9. A random variable has PMF p(k) = (1−p)^{k−1}p for k = 1,2,3,… Verify it's valid and compute P(X > 3).
Q10. In code you see `np.random.exponential(scale=2)`. What is the inverse-CDF formula being used internally?
A random variable maps outcomes to numbers. Discrete RVs have PMFs (probability at points); continuous RVs have PDFs (density, not probability — can exceed 1). The CDF F(x) = P(X ≤ x) is the universal descriptor for all types. Quantiles invert the CDF. Simulate any distribution by drawing U ~ Uniform(0,1) and applying F⁻¹(U).
Named distributions are not a menagerie to memorize — they form a family tree where each member arises from a simple story. This chapter traces the genealogy from Bernoulli to Poisson to Exponential to Gaussian, gives you the one-line story, mean/variance, and the interview tell for each, and closes with a master comparison table tying every distribution to its ML cameo.
Bernoulli(p) — Story: one coin flip with heads probability p.
Mean = p. Var = p(1−p). ML cameo: the output of a logistic classifier; every binary cross-entropy loss assumes Bernoulli noise.
Binomial(n, p) — Story: count the number of heads in n independent coin flips.
Mean = np. Var = np(1−p). Note: Binomial(1,p) = Bernoulli(p). Sum of independent Bernoullis is Binomial.
Poisson(λ) — Story: count rare events in a fixed time window (λ = expected count).
Mean = λ. Var = λ. The hallmark: mean equals variance. ML cameo: click counts, word counts in text (bag-of-words), arrival queues.
Trigger: "We have n trials, each succeeding with small probability p, and we care about the count."
If n is large and p is small so that λ = np stays moderate, Binomial(n,p) ≈ Poisson(λ). Rule of thumb: use Poisson when n ≥ 20 and p ≤ 0.05.
Never: use Poisson when p is not small — Binomial is exact, Poisson is an approximation.
Geometric(p) — Story: number of trials until the first success (inclusive).
Mean = 1/p. Var = (1−p)/p². Caution: some textbooks define Geometric as the number of failures before success (starting at 0) — always check the support.
Memoryless property of the Geometric: P(X > m+k | X > m) = P(X > k). Past failures give no information about future success. The Geometric is the only discrete memoryless distribution.
A bus arrives with probability p = 0.2 each minute. You just waited 5 minutes — what's the probability you wait at least 3 more minutes?
By memorylessness: P(X > 5+3 | X > 5) = P(X > 3) = (1−0.2)³ = 0.8³ = 0.512.
The 5 minutes already waited are completely irrelevant. The distribution has no memory of them.
Negative Binomial(r, p) — Story: number of trials until the r-th success. Geometric is the special case r=1.
Mean = r/p. Var = r(1−p)/p². ML cameo: over-dispersed count data where Var > Mean (unlike Poisson where they're equal); used for word frequency models in NLP.
Uniform(a, b) — Story: every value in [a,b] equally likely.
Mean = (a+b)/2. Var = (b−a)²/12. Foundation of inverse-CDF sampling; Uniform(0,1) is the source of all pseudo-random numbers.
Exponential(λ) — Story: waiting time until first event in a Poisson process with rate λ.
Mean = 1/λ. Var = 1/λ². The only continuous memoryless distribution. CDF: F(x) = 1 − e^{−λx}. ML cameo: inter-arrival times in queuing models; lifetime distributions in reliability; the base distribution in many Bayesian priors.
Poisson process plain words: Events arrive randomly over time such that (1) the number of arrivals in any interval depends only on its length, not location; (2) arrivals in disjoint intervals are independent; (3) at most one arrival per infinitesimal instant. Then: counts are Poisson(λt); inter-arrival times are Exponential(λ).
Gamma(α, β) — Story: sum of α independent Exponential(β) waiting times.
Mean = α/β. Var = α/β². Special cases: Gamma(1, β) = Exponential(β); Gamma(n/2, 1/2) = Chi-squared(n). ML cameo: conjugate prior for the Poisson rate and Gaussian precision.
Normal(μ, σ²) — Story: the limiting distribution of sums of many small independent effects (Central Limit Theorem).
Mean = μ. Var = σ². Standard Normal: μ=0, σ=1, written Z ~ N(0,1). Any Normal can be standardized: Z = (X−μ)/σ.
ML cameos everywhere: linear regression assumes Gaussian noise (MLE → OLS); neural network weights often initialized as Normal; CLT justifies using Normal approximations for sums/averages; the Gaussian kernel in SVMs and GPs; the reparameterization trick in VAEs samples X = μ + σ·ε with ε ~ N(0,1).
The Normal distribution assumes: symmetric, unimodal, exponential tail decay. It's wrong for count data (use Poisson/NegBin), bounded data (use Beta/Dirichlet), heavy-tailed data (use Student-t or Cauchy), and strictly positive data (use LogNormal or Gamma). The CLT gives Gaussians as limits for averages — not for individual observations.
Log-Normal(μ, σ²) — If log X ~ Normal(μ, σ²), then X ~ LogNormal. Story: product of many small positive multiplicative effects. Mean = e^{μ+σ²/2}. ML cameo: income distributions, stock prices, file sizes — any quantity that is bounded below by 0 and right-skewed.
Beta(α, β) — Story: a probability on [0,1] with α−1 pseudo-successes and β−1 pseudo-failures as prior evidence.
ML cameo: conjugate prior for a Bernoulli/Binomial proportion; Bayesian A/B testing; Thompson sampling for bandits.
| Distribution | Story (one line) | Params | Mean | Var | ML cameo |
|---|---|---|---|---|---|
| Bernoulli(p) | One binary trial | p ∈ (0,1) | p | p(1−p) | Logistic output, binary CE loss |
| Binomial(n,p) | Count successes in n trials | n,p | np | np(1−p) | Proportion estimates, exact binomial CI |
| Poisson(λ) | Count rare events in fixed window | λ>0 | λ | λ | Word counts (NLP), click events |
| Geometric(p) | Trials to first success | p ∈ (0,1) | 1/p | (1−p)/p² | Length of sequence until event |
| NegBin(r,p) | Trials to r-th success | r, p | r/p | r(1−p)/p² | Over-dispersed counts in NLP |
| Uniform(a,b) | All values equally likely | a < b | (a+b)/2 | (b−a)²/12 | Inverse-CDF sampling, random splits |
| Exponential(λ) | Waiting time to first event | λ>0 | 1/λ | 1/λ² | Inter-arrival times, survival models |
| Gamma(α,β) | Sum of α exponential waits | α,β>0 | α/β | α/β² | Prior for Poisson rate / Gaussian precision |
| Normal(μ,σ²) | CLT limit of many small effects | μ, σ²>0 | μ | σ² | Linear regression noise, weight init, VAE |
| LogNormal(μ,σ²) | Exp of Gaussian; product of effects | μ, σ² | e^{μ+σ²/2} | (e^{σ²}−1)e^{2μ+σ²} | Income, file sizes, latency distributions |
| Beta(α,β) | Prior probability on [0,1] | α,β>0 | α/(α+β) | αβ/((α+β)²(α+β+1)) | Bayesian A/B, Thompson sampling |
| Categorical(p) | One draw from k classes | p₁+…+pₖ=1 | — | — | Softmax output, language model tokens |
Trigger: Interviewer describes a scenario and asks you to model it.
- "Binary outcome, one trial" → Bernoulli
- "Count of successes in n fixed trials" → Binomial
- "Count of rare events in a fixed interval" → Poisson
- "Waiting time / time between events" → Exponential
- "Waiting for k-th event" → Gamma
- "Many small additive effects, averages, measurement noise" → Gaussian
- "A proportion or probability (value between 0 and 1)" → Beta
- "Strictly positive, right-skewed, products of factors" → LogNormal
- "Counts with Var > Mean (over-dispersed)" → Negative Binomial
Never: default to Gaussian without checking whether the quantity is bounded, discrete, or heavy-tailed. Always state the assumption you're making and why.
"What's the difference between Poisson and Binomial?" — Binomial has fixed n trials; Poisson is the n→∞, p→0 limit where np=λ stays constant. Poisson is used when n is unknown or very large and p is small. The telltale: Poisson has mean = variance; Binomial has variance = mean·(1−p) < mean.
"Prove the Geometric is memoryless." — P(X > m+k | X > m) = P(X > m+k)/P(X > m) = (1−p)^{m+k}/(1−p)^m = (1−p)^k = P(X > k).
- Bernoulli → Binomial (sum n), Binomial → Poisson (rare limit), Geometric → NegBin (r successes).
- Exponential is the continuous memoryless distribution; Geometric is the discrete memoryless distribution.
- Poisson: mean = variance = λ. This is the diagnostic — over-dispersed data (Var > Mean) needs NegBin.
- Normal is the CLT limit — for averages of many independent summands, not necessarily individual observations.
Q1. A website gets 120 visits/hour on average. What's the probability of exactly 100 visits in one hour?
Q2. Why does the Exponential distribution have the memoryless property, and what does it imply for modeling?
Q3. Your count data has mean 5 and variance 20. Which distribution do you choose and why?
Q4. What is the relationship between the Gamma and Chi-squared distributions?
Q5. Explain why the LogNormal is appropriate for latency distributions in distributed systems.
Q6. A coin has unknown bias p. You observe 7 heads in 10 flips. Write the likelihood L(p) and find the MLE.
Q7. What is the Poisson process and what are its three defining properties?
Q8. The Beta(1,1) distribution — what is it, and why is it the "uninformative prior" for a coin's bias?
Q9. Two models predict daily sales: Model A uses Poisson(50), Model B uses Normal(50, 50). Compare them — which is more appropriate and when?
Q10. What is the coefficient of variation (CV) and which distributions have a constant CV?
Distributions form a family tree: Bernoulli → Binomial → Poisson (rare-event limit); Geometric → Negative Binomial; Uniform → Exponential (memoryless waiting) → Gamma; all paths lead to Gaussian via CLT. Each has a one-line story; match the story to the data type (discrete/continuous, bounded/unbounded, Var vs Mean) to choose correctly.
Expectation is the probabilistic center of mass, and variance measures how spread out the mass is. This chapter builds both from scratch, then reveals the indicator method — a single trick that dissolves most "expected count of X" problems in seconds. Along the way: LOTUS lets you compute $E[g(X)]$ without finding the distribution of $g(X)$, and Jensen's inequality tells you exactly which direction a nonlinear transformation biases an expectation — a fact that quietly underpins the ELBO in variational inference.
If you run an experiment many times and average the outcomes, the number you converge to is the expectation. Formally, for a discrete random variable $X$:
Tiny example. A fair die: values 1–6, each with probability $\tfrac{1}{6}$.
For a continuous RV, the sum becomes an integral: $E[X] = \int_{-\infty}^{\infty} x\, f(x)\, dx$, where $f$ is the PDF.
Center of mass picture. Imagine the probability mass function as a physical bar chart. The expectation is the balance point. Shifting all mass right shifts $E[X]$ right by the same amount — that's linearity in action.
For ANY random variables $X$ and $Y$ (dependent or not) and constants $a, b$:
Trigger: you need $E[\text{sum of many things}]$ and the things are tangled together.
- Write the sum: $X = X_1 + X_2 + \cdots + X_n$.
- Apply linearity: $E[X] = E[X_1] + E[X_2] + \cdots + E[X_n]$.
- Compute each $E[X_i]$ individually — often trivial.
Never: try to find the joint distribution of all $X_i$ to compute the expectation. That's unnecessary and usually intractable. Linearity sidesteps joint distributions entirely.
Linearity $E[X+Y] = E[X] + E[Y]$ holds for ALL pairs $(X,Y)$. No independence. No uncorrelatedness. Always. This is not obvious — it's a theorem. The proof: $E[X+Y] = \sum_{x,y}(x+y)P(X{=}x,Y{=}y) = \sum_x x P(X{=}x) + \sum_y y P(Y{=}y)$. Done. Dependence enters only for $\text{Var}(X+Y)$, not for $E[X+Y]$.
An indicator random variable $I_A$ for event $A$ is the simplest possible RV: it equals 1 if $A$ occurs, 0 otherwise. Its expectation is just the probability of the event:
The strategy: write the count you want as $X = I_1 + I_2 + \cdots + I_n$, then use linearity to get $E[X] = \sum_i P(A_i)$. The $I_i$ are almost always dependent — it doesn't matter.
Trigger: "how many X satisfy some property?" or "how many pairs/matches/collisions?"
- Define one indicator per possible object: $I_i = 1$ if object $i$ has the property.
- Count = $X = \sum_i I_i$.
- $E[X] = \sum_i P(I_i = 1)$ by linearity.
- Each $P(I_i = 1)$ is usually easy — just the probability for a single object.
Never: enumerate outcomes directly. For $n$ objects this is $O(2^n)$ case work; the indicator method is $O(n)$ thought.
Classic 1: Expected number of fixed points of a random permutation.
Setup. Shuffle $n$ cards labeled 1 through $n$. A fixed point is a card in its original position. What is $E[\text{fixed points}]$?
Without indicators. You would need to enumerate all $n!$ permutations, count fixed points in each, and average. Hopeless for large $n$.
With indicators. Let $I_i = 1$ if card $i$ is in position $i$. Then fixed points $= \sum_{i=1}^n I_i$. By linearity:
The indicators $I_1, \ldots, I_n$ are heavily dependent (if $n-1$ cards are fixed, the last must be too), yet linearity gives us the answer in two lines.
Classic 2: Expected number of coupons to complete the collection.
Setup. There are $n$ distinct coupon types, each draw uniformly at random. How many draws $T$ until you have at least one of each type? (The coupon collector problem.)
Decompose the waiting time. Let $T_k$ = draws to go from $k-1$ distinct types to $k$ distinct types. Then $T = T_1 + T_2 + \cdots + T_n$. At stage $k$ you already have $k-1$ types, so any new draw is a success with probability $\frac{n-(k-1)}{n} = \frac{n-k+1}{n}$. The number of draws at stage $k$ is Geometric with success probability $p_k = \frac{n-k+1}{n}$, so $E[T_k] = \frac{n}{n-k+1}$. By linearity:
This is used in ML for estimating how many examples you need before seeing every class at least once — relevant to rare-category sampling and dataset coverage analysis.
Classic 3: Expected number of birthday pairs.
Setup. $n$ people, each born on one of 365 days uniformly. How many pairs share a birthday?
With indicators. There are $\binom{n}{2}$ pairs. For pair $(i,j)$ let $I_{ij} = 1$ if they share a birthday. Then:
This is consistent with the famous fact that 23 people gives ~50% chance of at least one shared birthday — there's roughly $0.69$ expected pairs, so one pair is quite likely.
Variance measures the average squared deviation from the mean:
The standard deviation $\sigma = \sqrt{\text{Var}(X)}$ restores the original units.
Covariance measures how two RVs move together:
Variance of a sum — the full formula:
Numeric 2-variable example. Let $X$ be the number shown on a fair coin mapped to $\{0,1\}$ (heads=1, tails=0), and $Y = 1 - X$ (the opposite). Then $E[X]=E[Y]=0.5$, $\text{Var}(X)=\text{Var}(Y)=0.25$.
$\text{Cov}(X,Y) = E[XY] - E[X]E[Y] = 0 - 0.25 = -0.25$ (since $XY=0$ always — they can't both be 1).
General formula for $n$ variables:
$E[X+Y] = E[X] + E[Y]$ always. But $\text{Var}(X+Y) = \text{Var}(X) + \text{Var}(Y)$ only when $\text{Cov}(X,Y)=0$. Forgetting the covariance term when variables are dependent is a common interview error. The safe habit: always write the full formula and then check whether covariance terms are zero.
The Law of the Unconscious Statistician (LOTUS) says: to find the expected value of a function $g(X)$, you don't need the distribution of $g(X)$ — just plug into the original distribution of $X$.
Example. $X \sim \text{Uniform}\{1,2,3,4,5,6\}$ (fair die). What is $E[X^2]$?
LOTUS: $E[X^2] = \sum_{x=1}^6 x^2 \cdot \tfrac{1}{6} = \tfrac{1+4+9+16+25+36}{6} = \tfrac{91}{6} \approx 15.17$. Then $\text{Var}(X) = E[X^2] - (E[X])^2 = \tfrac{91}{6} - 3.5^2 = \tfrac{91}{6} - \tfrac{49}{4} = \tfrac{182-147}{12} = \tfrac{35}{12} \approx 2.92$.
LOTUS is used constantly: computing variance (via $E[X^2]$), moments, moment-generating functions, and the KL divergence $E[\log p(X)/q(X)]$ — all applications of LOTUS.
For convex f: E[f(X)] ≥ f(E[X]); for concave f the inequality flips. Picture: a chord above a convex curve — averaging the outputs lands you on the chord, above the function at the average input.
Where it bites in ML: (1) log-likelihood of an average ≠ average log-likelihood — why ensembling probabilities then logging beats averaging log-probs; (2) the ELBO: log p(x) = log E_q[p(x,z)/q(z)] ≥ E_q[log p(x,z)/q(z)] — variational inference IS one application of Jensen; (3) E[1/X] ≥ 1/E[X] — why averaging rates and averaging times disagree (harmonic-mean traps in throughput math).
Linearity of expectation needs no independence — that's the superpower. Any "expected number of…" question becomes: define an indicator per potential event, compute each P(event), add. Variance is where dependence re-enters (the covariance cross-terms), LOTUS computes E[g(X)] without finding g(X)'s distribution, and Jensen warns you that nonlinear functions and expectations don't commute — with log-likelihoods as the ML cash-out.
Q1. 100 people, random hat returns. Expected number who get their own hat — and why is independence irrelevant?
Q2. Coupon collector: expected boxes to collect all 50 coupons?
Q3. In n=23 people, expected number of PAIRS sharing a birthday?
Q4. Var(X+Y) = 25, Var(X) = 9, Var(Y) = 4. What's Cov(X,Y) and corr?
Q5. Why is E[1/X] ≠ 1/E[X], and where does this bite in systems work?
Q6. Expected number of runs in 10 fair coin flips (maximal blocks of same face)?
Q7. LOTUS in one line: compute E[X²] for a die without finding the distribution of X².
Q8. A portfolio holds 3 assets each with variance 1 and pairwise correlation ρ. Variance of the equal-weight average — and the diversification limit?
Q9. Why does maximizing log-likelihood instead of likelihood change nothing — and what does Jensen warn about averaging them?
Q10. E[XY] = E[X]E[Y] — does this imply independence? Prove your answer with a 3-point example.
Two random variables together contain far more information than their individual distributions alone. This chapter builds the machinery of joint distributions from scratch—computing marginals and conditionals by hand from a 4-cell table, unpacking the subtle differences among covariance, correlation, and independence, and then proving why conditional expectation is the mathematically optimal predictor. The tower rule and law of total variance let you decompose hard expectations into manageable pieces; the bivariate Gaussian shows where all of this lands geometrically. This machinery underlies linear regression, mixture models, and latent-variable methods throughout ML.
A joint distribution over two random variables X and Y assigns a probability to every pair of values (x, y). For discrete RVs the joint PMF is:
All joint probabilities must sum to 1 over every pair of values. From the joint you can always recover the individual (marginal) distributions by summing out the other variable:
Let X = "first coin flip" and Y = "second coin flip," both fair. Each takes values in {0, 1} (0 = tails, 1 = heads). The flips are independent, so the joint probability of every pair is 1/4.
| Y = 0 | Y = 1 | P(X = x) marginal | |
|---|---|---|---|
| X = 0 | 1/4 | 1/4 | 1/2 |
| X = 1 | 1/4 | 1/4 | 1/2 |
| P(Y = y) marginal | 1/2 | 1/2 | 1 ✓ |
Now make it more interesting. Suppose X ∈ {0, 1} with P(X=0) = 0.6, P(X=1) = 0.4, and given X the conditional distribution of Y is:
- P(Y=1 | X=0) = 0.3 (so P(Y=0 | X=0) = 0.7)
- P(Y=1 | X=1) = 0.8 (so P(Y=0 | X=1) = 0.2)
The joint table (each cell = P(X=x) × P(Y=y|X=x), verified to sum to 1):
| Y = 0 | Y = 1 | Marginal P(X=x) | |
|---|---|---|---|
| X = 0 | 0.6 × 0.7 = 0.42 | 0.6 × 0.3 = 0.18 | 0.60 |
| X = 1 | 0.4 × 0.2 = 0.08 | 0.4 × 0.8 = 0.32 | 0.40 |
| Marginal P(Y=y) | 0.50 | 0.50 | 1.00 ✓ |
Arithmetic check: 0.42 + 0.18 + 0.08 + 0.32 = 1.00 ✓. Marginal P(Y=1) = 0.18 + 0.32 = 0.50 ✓.
Now recover a conditional from the joint. What is P(X=1 | Y=1)?
Interpretation: even though only 40% of the population has X=1, among those with Y=1 the fraction rises to 64%—because X=1 makes Y=1 much more likely.
Covariance measures how two RVs move together. Positive = they tend to rise together; negative = one rises as the other falls; zero = no linear tendency.
Correlation is covariance scaled to [−1, 1] by dividing by both standard deviations:
Correlation measures linear association only. The classic counterexample: let X ~ Uniform(−1, 1) and Y = X².
- E[X] = 0 by symmetry of the Uniform.
- E[XY] = E[X · X²] = E[X³] = 0 by symmetry (odd function integrated over symmetric interval).
- Therefore Cov(X, Y) = E[XY] − E[X]E[Y] = 0 − 0·E[Y] = 0.
- Correlation ρ(X, X²) = 0.
- Yet Y = X² is a deterministic function of X — as far from independent as possible.
Independence requires all moments and all events to be uncoupled, not just the first-order linear relationship. Formally: X ⊥ Y iff P(X ∈ A, Y ∈ B) = P(X ∈ A)·P(Y ∈ B) for every measurable A, B.
The hierarchy: Independence ⟹ Uncorrelated. Uncorrelated does NOT ⟹ Independent. (Exception: jointly Gaussian RVs, where uncorrelated does imply independent.)
Trigger: "Are X and Y independent if their correlation is zero?"
- Answer: Not necessarily. Correlation only captures linear dependence.
- Give the X, X² counterexample immediately: Uniform(−1,1), Y = X², Cov = 0, but Y is determined by X.
- State the exception: for jointly Gaussian RVs, ρ = 0 does imply independence.
Never: Confuse "uncorrelated" with "independent" — this is one of the most common mistakes in ML interviews.
Joint distributions are where probability becomes multivariate reality: marginals come from summing rows, conditionals from shrinking to a row and renormalizing, covariance measures linear co-movement (and only linear — corr 0 ≠ independent, the X, X² example), conditional expectation E[Y|X] is the best predictor of Y from X (regression's birthplace), and the tower rule E[E[Y|X]] = E[Y] plus the law of total variance let you compute by conditioning on what you wish you knew. For jointly Gaussian variables everything linearizes — which is exactly why Gaussians are the workhorse.
Q1. Give a concrete pair of dependent but uncorrelated random variables, and explain in one line why correlation misses it.
Q2. Two coins: fair (p=0.5) and biased (p=0.9), picked with probability ½ each, flipped 10 times. Use the tower rule and total variance for the count's mean and variance.
Q3. Why is E[Y|X] called the best predictor — best in what sense, and what's the ML cameo?
Q4. For a bivariate Gaussian with correlation ρ, what is E[Y | X = x], and what does it explain about "regression to the mean"?
Q5. Var(X+Y) for correlated X, Y — formula and a portfolio one-liner.
Q6. From a 2×2 joint table with P(A=1,B=1)=0.2, P(A=1)=0.4, P(B=1)=0.5 — independent? Compute the check.
Q7. Why does the law of total variance guarantee Var(Y) ≥ Var(E[Y|X]), and what does the gap mean in an ML eval context?
Q8. Covariance is bilinear: expand Cov(2X + 3, Y − 1) and state the two rules used.
Q9. You observe corr(ice cream sales, drownings) = 0.7. Walk the three causal readings and the variable that resolves it.
Q10. In a recommender, user-level CTRs are estimated per segment. Why does the law of total variance argue for hierarchical/shrunk estimates?
This chapter is the bridge between a single random variable and the behaviour of many. We start with three concentration inequalities — Markov, Chebyshev, and Hoeffding — each bounding the probability of a large deviation, each sharper than the last. Then we meet the Law of Large Numbers (convergence of averages) and the Central Limit Theorem (shape of that convergence). Every confidence interval, every batch gradient, every error bar in an ML paper lives here.
A concentration inequality answers: "How likely is a random variable to be far from its mean?" The three canonical tools trade assumption strength for tightness. We will carry one running example all the way through so the tightening is concrete.
Roll a single fair die once. Let X be the outcome. Mean: $\mu = E[X] = 3.5$. Variance: $\sigma^2 = \text{Var}(X) = \tfrac{35}{12} \approx 2.917$. Standard deviation: $\sigma \approx 1.708$.
Question we will bound three ways: $P(X \ge 6)$. Exact answer (for calibration): $\tfrac{1}{6} \approx 0.167$.
Markov needs only one thing: X is non-negative. Nothing about variance or distribution shape.
Running example. $a = 6$, $E[X] = 3.5$:
Why so loose? Markov knows nothing about spread. A distribution could put all its mass at $a - \varepsilon$ and still satisfy $E[X] = 3.5$. Knowing the variance fixes that.
Apply Markov to the non-negative variable $(X - \mu)^2$. The result bounds the probability of deviating from the mean by more than $k$ standard deviations.
Running example. $X \ge 6$ means $X - 3.5 \ge 2.5$, i.e., deviation $\ge 2.5$. With $\sigma \approx 1.708$: $k = 2.5 / 1.708 \approx 1.464$.
The $1/k^2$ decay is slow. Chebyshev is tight only for pathological distributions (two-point). For bounded random variables we can do exponentially better.
When $X$ is bounded in $[a, b]$ (or more generally when we have a sum of independent bounded terms), Hoeffding gives an exponentially decaying bound — the real workhorse in PAC learning, bandit theory, and generalisation bounds.
Running example, single die ($n=1$). $X \in [1, 6]$, range $= 5$, $t = 6 - 3.5 = 2.5$.
With $n=100$ dice, $t = 4 - 3.5 = 0.5$, range $= 5$:
Chebyshev: $P(\bar{X} - 3.5 \ge 0.5) \le P(|\bar{X}-3.5| \ge 0.5)$. $\text{Var}(\bar{X}) = \sigma^2/n \approx 2.917/100 = 0.02917$, $\sigma_{\bar X} \approx 0.1708$. $k = 0.5/0.1708 \approx 2.93$, bound $\le 1/k^2 \approx 0.117$.
Hoeffding: $\exp(-2 \cdot 100^2 \cdot 0.25 / (100 \cdot 25)) = \exp(-2) \approx 0.135$.
Here Chebyshev actually wins numerically, but Hoeffding's bound on the exact tail $P(\bar X - \mu \ge t)$ (one-sided) is $e^{-2} \approx 0.135$. At $t = 1$ (mean ≥ 4.5), Hoeffding gives $e^{-8} \approx 0.00034$ vs Chebyshev's $0.029$ — the exponential win is clear for larger deviations.
Informally: "the sample mean converges to the true mean as $n \to \infty$." Formally, two flavours.
Sum many independent, small-influence random effects and the sum's distribution approaches a Gaussian — regardless of the ingredients' shape. Formally: for i.i.d. X₁…Xₙ with mean μ and variance σ²,
Worked example: roll a fair die 100 times; P(sum > 380)? Per roll: μ = 3.5, σ² = 35/12 ≈ 2.917. Sum: mean 350, variance 291.7, SD ≈ 17.08. z = (380.5 − 350)/17.08 ≈ 1.79 (with continuity correction) → P ≈ 1 − Φ(1.79) ≈ 3.7%. A die is flat as can be — yet 100 of them produce a clean bell.
The CLT is about the sample mean's distribution, not the data's. Your data stays skewed forever; it's the AVERAGE of n draws whose distribution normalizes. Also, "n ≥ 30" is folklore, not law: heavy skew needs more, near-symmetric needs fewer, and infinite-variance distributions (Cauchy) never comply.
- Mini-batch gradients are sample means — approximately Gaussian noise around the true gradient, with variance ∝ 1/batch size. Batch size is a noise dial; LR scaling rules are downstream of this fact.
- Eval error bars: accuracy on n examples carries SE = √(p(1−p)/n) ≈ 0.5/√n at worst — on 1,000 examples that's ±1.5%, so a 0.8% "improvement" is noise. Always report n with the metric.
- A/B tests stand entirely on CLT-normality of metric means — chapters 12–13 cash this in.
- Estimate = sample mean x̄. Uncertainty: SE = s/√n.
- Quote x̄ ± 1.96·SE for 95% — licensed by the CLT once n is moderately large.
- Sanity-check the regime: heavy tails or tiny n → say "CLT is shaky here, I'd bootstrap or use t."
- If asked "how many samples for ±ε?": n = (1.96·σ/ε)² — precision costs quadratically.
The inequality ladder — Markov (mean only, crude) → Chebyshev (adds variance) → Chernoff/Hoeffding (adds independence, exponential) — buys guarantees with assumptions. LLN says sample means converge to the truth; CLT says HOW they fluctuate on the way: Gaussianly, with SD σ/√n. That √n is the master constant of applied statistics — it prices every error bar, every A/B test, and every "how much data do we need."
Q1. A model's accuracy is "73.4% on the test set." What single follow-up exposes whether that number means anything?
Q2. Use Markov, then Chebyshev, on: mean latency 100ms, SD 20ms — bound P(latency > 200ms).
Q3. Why does the gambler's fallacy NOT follow from the law of large numbers?
Q4. Hoeffding gives P(|x̄−μ| > ε) ≤ 2exp(−2nε²) for bounded variables. Invert it into a sample-size statement.
Q5. Batch size 256 → 4096. What happens to gradient noise, and why doesn't training just get 16× better?
Q6. P(sum of 100 die rolls > 380) — show the full pipeline.
Q7. Your latency data is heavily right-skewed. Is the CI on the MEAN from ±1.96·s/√n valid? And is the mean even the right target?
Q8. Why do Cauchy-distributed errors break both the LLN-in-practice and the CLT?
Q9. Two estimators are unbiased; one has variance 4/n, the other 1/n. Quantify "how much more data" the worse one needs.
Q10. Interviewer: "Give me one sentence on why the Gaussian is everywhere."
Statistics flips the probability arrow: given data you observed, what do you infer about the parameters that generated it? This chapter builds the vocabulary for judging any estimator (bias, variance, MSE, consistency), derives MLE from first principles for two workhorses — Bernoulli and Gaussian — and then shows how adding a prior converts MLE into MAP, which turns out to be exactly regularized regression in disguise.
An estimator θ̂ is any function of the data X₁,…,Xₙ. The key word is function — before data arrives, θ̂ is itself a random variable (different samples → different estimates). That randomness is what bias and variance measure.
Example: you flip a coin 10 times and see 7 heads. One natural estimator of the bias p is θ̂ = 7/10 = 0.7. Another is θ̂ = (7+1)/(10+2) ≈ 0.667 (Laplace smoothing). Which is better? That depends on the loss function and the true p — which is what bias/variance/MSE formalize.
Derivation in 3 lines: Let $\mu = E[\hat\theta]$. Then $(\hat\theta - \theta) = (\hat\theta - \mu) + (\mu - \theta)$. Square and take expectation: the cross term vanishes (why? $E[\hat\theta - \mu]=0$), leaving $E[(\hat\theta-\mu)^2] + (\mu-\theta)^2 = \text{Var}(\hat\theta) + \text{Bias}^2$.
The sample variance $S^2 = \frac{1}{n-1}\sum(X_i - \bar X)^2$ uses $n-1$ (Bessel's correction). Why not $n$?
Divide by $n$: $\hat\sigma^2_n = \frac{1}{n}\sum(X_i - \bar X)^2$. This estimator is biased: $E[\hat\sigma^2_n] = \frac{n-1}{n}\sigma^2 < \sigma^2$.
Intuition: you compute deviations from $\bar X$, the sample mean — not from the true $\mu$. The sample mean is "too close" to its own data because it is computed from those same points. This absorbs some of the variance, making the estimate systematically small. Dividing by $n-1$ corrects exactly for this one degree of freedom consumed by estimating the mean.
When it matters: for large $n$ the factor $(n-1)/n \approx 1$ and it barely matters. For $n=5$ it shrinks your estimate by 20% — that is a real bias.
Maximum Likelihood Estimation answers: which parameter value makes the data I actually observed most probable?
- Write the likelihood $L(\theta) = \prod_{i=1}^n p(x_i;\theta)$ (assuming i.i.d.).
- Take the log: $\ell(\theta) = \sum_{i=1}^n \log p(x_i;\theta)$. Products become sums; log is monotone so the argmax is the same.
- Differentiate $\frac{d\ell}{d\theta} = 0$ (or gradient = 0 for vector $\theta$).
- Solve for $\theta$. Check the second derivative is negative (concavity = maximum, not minimum).
Never: skip the log step — the product form is nearly always intractable to differentiate directly.
Data: $n$ independent coin flips; $k$ heads. Unknown: bias $p \in [0,1]$.
Step 1 — Likelihood:
Step 2 — Log-likelihood:
Step 3 — Differentiate:
Step 4 — Solve: $k(1-p) = (n-k)p \Rightarrow k = np \Rightarrow \hat p_{\text{MLE}} = k/n$.
Sanity check: 7 heads in 10 flips → $\hat p = 0.7$. This is just the empirical frequency. MLE formalizes the obvious choice, which is reassuring.
Data: $x_1,\dots,x_n \sim \mathcal{N}(\mu, \sigma^2)$, $\sigma^2$ known. Estimate $\mu$.
Step 1 — Likelihood:
Step 2 — Log-likelihood:
Steps 3–4 — Differentiate and solve:
Key insight: "maximize Gaussian likelihood" and "minimize sum of squared residuals" are the same problem. This is how Gaussian noise assumptions sneak into every regression.
The Fisher information measures how sharply the likelihood peaks around the true $\theta$ — how much information a single observation carries about $\theta$.
Plain words: if observing $X$ barely changes how you think about $\theta$ (flat likelihood), information is low. If the likelihood is sharply peaked, every observation is highly informative.
MLE ignores everything you knew before seeing the data. MAP (Maximum A Posteriori) incorporates a prior belief $p(\theta)$ and maximizes the posterior:
Suppose you believe the coin is roughly fair before seeing any data. Encode this as a Beta(α, β) prior on $p$.
Prior: $p(\theta) \propto \theta^{\alpha-1}(1-\theta)^{\beta-1}$. With $\alpha=\beta=2$: mild preference for fairness.
Data: $k$ heads in $n$ flips. Likelihood: $\theta^k(1-\theta)^{n-k}$.
Posterior (by Bayes):
MAP estimate: the mode of Beta$(k+\alpha, n-k+\beta)$ is:
Numeric example: $n=10$, $k=1$, prior $\alpha=\beta=2$. MLE: $1/10 = 0.10$. MAP: $2/12 \approx 0.167$. The prior pulls toward 0.5 — sensible when data is scarce.
As data grows: with $n=1000$, $k=100$: MAP $= 101/1002 \approx 0.1008$, MLE $= 0.100$. The two pseudo-counts become negligible; MAP → MLE.
Pseudo-count interpretation: $\alpha-1$ is your "imaginary" prior heads, $\beta-1$ your prior tails. Beta(2,2) is like having seen one head and one tail before any real data.
In linear regression, we often minimize a regularized loss. Where does the regularization term come from? It is exactly the MAP prior:
- L2 regularization (Ridge): $\min_w \|Xw - y\|^2 + \lambda\|w\|^2$. This is MAP with a Gaussian prior $w_j \sim \mathcal{N}(0, 1/\lambda)$. Taking $-\log$ of the prior gives $\frac{\lambda}{2}\|w\|^2$ up to a constant.
- L1 regularization (Lasso): $\min_w \|Xw - y\|^2 + \lambda\|w\|_1$. This is MAP with a Laplace prior $w_j \sim \text{Laplace}(0, 1/\lambda)$. The Laplace prior has heavier tails near zero — it encourages sparsity because the $|w|$ penalty makes exactly-zero solutions optimal at kinks.
Why does this matter? When an interviewer asks "why does L1 encourage sparsity and L2 doesn't?", the Bayesian answer is: the Laplace prior has a sharp peak at zero, so the mode of the posterior tends to have exact zeros. The Gaussian prior is smooth at zero, so it shrinks but does not zero out. This is a principled answer, not a geometric coincidence.
"What happens to the MAP estimate as the sample size grows?" — Answer: the log-prior term becomes negligible relative to the growing log-likelihood; MAP converges to MLE. The prior only matters when $n$ is small relative to the strength of the prior.
"You trained Ridge regression. What prior does that imply?" — Gaussian prior on weights, centered at zero, with variance $1/\lambda$. The regularization strength $\lambda$ encodes how tightly you believe weights should stay near zero.
"Why is MLE biased for the variance estimator?" — The MLE of $\sigma^2$ for a Gaussian is $\frac{1}{n}\sum(x_i-\bar x)^2$, which has expectation $\frac{n-1}{n}\sigma^2$. The bias arises because we estimate $\mu$ from the same data, consuming one degree of freedom.
- MSE = Bias² + Variance. Unbiased ≠ best; sometimes a small bias buys large variance reduction.
- MLE: write likelihood, log it, differentiate, solve. For Bernoulli: $k/n$. For Gaussian mean: $\bar x$ (same as least squares).
- MAP = MLE + log-prior. With Beta prior on Bernoulli: pseudo-counts shift the MLE. Prior fades as $n\to\infty$.
- L2 regularization ↔ Gaussian prior. L1 ↔ Laplace prior. Sparsity from L1 = sharp mode of Laplace at zero.
An estimator is any function of data; judge it by bias, variance, MSE, and consistency. MLE picks the parameter that maximizes the probability of your data — for Bernoulli it gives the sample fraction, for Gaussian it gives the sample mean, both provably efficient. MAP adds a prior and becomes MLE in disguise for large $n$; the Beta-Bernoulli case reveals the pseudo-count interpretation. Regularization is MAP: L2 = Gaussian prior, L1 = Laplace prior — the L1 sparsity story is just the Laplace mode being exactly zero.
Q1. What is the difference between bias and variance in an estimator, and why does the trade-off matter?
Q2. Derive the MLE for the parameter p of a Bernoulli distribution.
Q3. Why does minimizing sum-of-squared errors in regression correspond to MLE under Gaussian noise?
Q4. What is the Beta-Bernoulli conjugate pair and why is conjugacy useful?
Q5. You train Ridge regression with penalty λ. What Bayesian prior does this correspond to, and what happens as λ→0 and λ→∞?
Q6. Why does L1 regularization produce sparse weights while L2 does not?
Q7. What does Fisher information measure, and how does it relate to the variance of any estimator?
Q8. How does the MAP estimate behave as the sample size n → ∞?
Q9. What is the MLE of the variance for a Gaussian, and is it biased? By how much?
Q10. A new estimator has lower MSE than the MLE but is biased. Would you use it?
Confidence intervals are the most widely misunderstood output of classical statistics. This chapter builds the z-interval from the CLT step by step, explains the frequentist meaning precisely, contrasts it with what a Bayesian credible interval actually means, covers the t-interval for small samples and the Wilson interval for proportions, and ends with the bootstrap — the universal tool for when the math gets too hard.
Suppose you want to estimate a population mean $\mu$. You collect a sample, run a formula, and output an interval $[L, U]$. You call it a "95% confidence interval." What does that guarantee?
The correct statement: The procedure that generated $[L, U]$ is designed so that, if you repeated the entire experiment many times — drawing new samples, computing new intervals each time — 95% of those intervals would contain the true $\mu$.
The interval in your hands is fixed. Either $\mu$ is inside it (probability 1) or it is not (probability 0). You do not know which. The 95% is a property of the procedure, not of this particular interval.
- Wrong: "There is a 95% chance that $\mu$ lies in $[2.1, 3.4]$." — $\mu$ is fixed and unknown; it is not random. The interval is the random thing.
- Wrong: "95% of data points fall in the CI." — The CI is about the mean, not the distribution of individual observations (that would be a prediction interval, which is wider).
- Wrong: "If we repeat the study and get a different CI, the first one was wrong." — Both could be valid 95% CI procedures. Across many repetitions, 5% of valid CIs will not contain $\mu$.
- Wrong: "A narrower CI is always better." — A CI can be made arbitrarily narrow by using a lower confidence level, or by assuming a distribution that does not hold. Width only makes sense relative to the confidence level and model validity.
- Wrong: "p-value ≤ 0.05 ↔ 95% CI excludes zero." — This equivalence holds for two-sided tests of $H_0: \mu=0$ but not in general for other nulls or one-sided tests.
Start from the CLT: for i.i.d. data with mean $\mu$ and variance $\sigma^2$, the standardized sample mean converges in distribution to standard normal:
Step 1: For large $n$, $P(-1.96 \leq Z \leq 1.96) \approx 0.95$ (standard normal table: 1.96 cuts off 2.5% each tail).
Step 2: Substitute $Z$ and rearrange:
Result: the 95% z-interval is $\bar x \pm 1.96\,\hat\sigma/\sqrt{n}$, where $\hat\sigma$ estimates $\sigma$ from the data.
Numeric example: 100 users, sample mean session time = 4.2 min, sample std dev = 2.0 min. 95% CI: $4.2 \pm 1.96 \cdot (2.0/\sqrt{100}) = 4.2 \pm 0.39 = [3.81, 4.59]$ minutes.
What moves the width: half-width $= z^* \cdot \sigma/\sqrt{n}$. To halve the CI width, you need $4\times$ the sample size. To target a specific width $w$: $n = (2z^*\sigma/w)^2$.
In practice, $\sigma$ is unknown and is replaced by the sample std dev $s$. When $n$ is large, $s \approx \sigma$ and the z-interval works fine. When $n$ is small, the estimation error in $s$ adds extra uncertainty — the t-distribution captures this.
Rule of thumb: use the t-distribution whenever $n < 30$ and the data are reasonably close to normal. For skewed distributions, bootstrap (below) is safer than t for small $n$.
For a proportion $p$ (e.g., click-through rate), the natural CI is the Wald interval: $\hat p \pm z\sqrt{\hat p(1-\hat p)/n}$. This works fine for moderate $p$ and large $n$, but it has a known failure mode:
If you observe 0 successes in 20 trials, $\hat p = 0$ and the Wald CI is $[0, 0]$ — a point estimate with zero width. But you have real uncertainty! The true $p$ could be 0.05 and you just got unlucky. The Wald interval is over-confident at the boundaries because it uses $\hat p$ to estimate the variance of $\hat p$, which collapses to zero when $\hat p = 0$.
Wilson interval (the fix): invert the score test rather than the Wald test. The formula:
In practice: for $\hat p \in [0.1, 0.9]$ and $n \geq 30$, Wald is fine. Use Wilson whenever $n\hat p < 5$ or $n(1-\hat p) < 5$ — i.e., you have few successes or few failures. In A/B testing on low-volume pages (e.g., 2% CTR, $n=100$), always use Wilson.
The CLT-based z-interval requires knowing (or approximating) the distribution of $\hat\theta$. For the mean, the CLT does this. For a median, a ratio, or an AUC, the sampling distribution is complicated. The bootstrap sidesteps this by treating your sample as a proxy for the population and resampling from it.
Core idea: you have one sample of size $n$. You cannot afford another experiment. But you can simulate the experiment by drawing $n$ observations with replacement from your existing sample. Each bootstrap resample is a plausible dataset you might have gotten; computing $\hat\theta$ on each gives you a simulated sampling distribution of $\hat\theta$.
- Compute $\hat\theta$ on original data of size $n$.
- For $b = 1, \dots, B$ (typically $B=1000$–$10000$):
- Draw $n$ samples with replacement from the original data → bootstrap sample $X^{(b)}$.
- Compute $\hat\theta^{(b)}$ on $X^{(b)}$.
- The bootstrap distribution of $\hat\theta^{(b)}$ approximates the sampling distribution of $\hat\theta$.
- 95% CI: take the 2.5th and 97.5th percentiles of $\{\hat\theta^{(b)}\}$ (percentile bootstrap), or use $\hat\theta \pm 1.96 \cdot \text{std}(\{\hat\theta^{(b)}\})$ (bootstrap standard error).
Never: resample without replacement — that just gives you the original sample permuted, not a new plausible dataset.
import numpy as np
def bootstrap_ci(data, stat_fn, B=5000, alpha=0.05):
n = len(data)
boot_stats = []
for _ in range(B):
resample = np.random.choice(data, size=n, replace=True)
boot_stats.append(stat_fn(resample))
lower = np.percentile(boot_stats, 100 * alpha / 2)
upper = np.percentile(boot_stats, 100 * (1 - alpha / 2))
return lower, upper
# Example: CI for the median
data = np.array([2.3, 1.1, 5.7, 3.2, 4.8, 2.9, 6.1, 1.8])
lo, hi = bootstrap_ci(data, np.median)
print(f"95% bootstrap CI for median: ({lo:.2f}, {hi:.2f})")
| Situation | Bootstrap works? | Why |
|---|---|---|
| Median, quantile, AUC | Yes | No analytic sampling distribution; bootstrap approximates it well. |
| Ratio metrics (revenue/user) | Yes | CLT for ratios requires delta method; bootstrap is easier and nearly as accurate. |
| Weird statistics (Gini, correlation) | Yes | Same reason. |
| Maximum (or minimum) | No | Extremes depend on the tail, which the empirical distribution misrepresents. |
| Very small n (< 20) | Caution | Empirical distribution is a poor proxy for the population; parametric bootstrap (assume a model) is better. |
| Dependent data (time series) | Block only | i.i.d. resampling breaks dependence structure; use block bootstrap. |
"Explain what a 95% CI means to a non-statistician." — "If we ran this study 100 times with new samples each time, about 95 of those intervals would contain the true value. This specific interval either contains it or doesn't — we just don't know which."
"Why is the bootstrap useful?" — It approximates the sampling distribution of any statistic without needing a closed-form formula. Especially valuable for medians, AUC, ratios, or any custom metric where analytic variance formulas are unknown.
"When would you use a t-interval vs a z-interval?" — Use t when $n < 30$ and data are approximately normal (the t accounts for uncertainty in estimating $\sigma$). For large $n$, both give essentially the same answer. For non-normal small samples, bootstrap is safer than either.
- 95% CI = the procedure covers truth 95% of repetitions. The specific interval's $\mu$ is either in or out — no probability statement about it.
- z-interval: $\bar x \pm 1.96\,\hat\sigma/\sqrt{n}$. To halve width, quadruple $n$.
- Use t instead of z when $n < 30$ and data near-normal. Use Wilson instead of Wald for proportions near 0 or 1.
- Bootstrap: resample with replacement $B$ times, compute statistic each time, take 2.5th/97.5th percentiles. Fails for extremes and tiny samples.
A 95% CI is a statement about the procedure, not the particular interval: 95% of such intervals cover the truth. Build z-intervals from the CLT ($\bar x \pm 1.96\,\hat\sigma/\sqrt{n}$); use t for small $n$; use Wilson for proportions near 0 or 1. When the statistic is complex (median, AUC, ratio), bootstrap — resample with replacement, compute the stat $B$ times, take percentiles — is the universal fallback, except for extremes and tiny samples.
Q1. Your A/B dashboard shows "CTR = 5.2% ± 0.3% (95% CI)". A PM says "there's a 95% chance the true CTR is between 4.9% and 5.5%." Correct them precisely.
Q2. Build the 95% CI for a mean from n=100, x̄=42, s=10 — and justify z vs t.
Q3. Why does the Wald interval for a proportion fail near p = 0 or 1, and what do you use instead?
Q4. Bootstrap the 95% CI for a MEDIAN in pseudocode, and say why the formula-based approach is hard.
Q5. When does the bootstrap FAIL, concretely?
Q6. Your metric is a ratio (revenue per click). Why is the naive per-row CI wrong, and what are the two fixes?
Q7. n=400, CI width is 2 points; the PM wants width 0.5. How many samples, and what's the general law?
Q8. Two 95% CIs overlap. The difference isn't significant, right?
Q9. The "learn more" trap: you compute a CI from the same data you used to select the metric that looked best. What's wrong?
Q10. Interviewer: "Why ±1.96 — where does it come from, in one breath?"
Chapters 10–11 built estimators and intervals; this chapter answers the binary question "is this effect real or noise?" You'll get the actual logic of testing (not the ritual), a precise p-value definition with the five classic misreadings, the Type I/II error trade and what moves power, the four workhorse tests each with a worked number, and the multiple-testing trap that quietly invalidates most dashboard "wins." Chapter 13 then puts all of it into production as A/B testing.
Every hypothesis test runs the same four-step program. (1) Assume a boring world — the null hypothesis $H_0$ ("the coin is fair", "the new model is no better"). (2) Pick a test statistic that measures discrepancy from that boring world. (3) Compute how surprising your observed statistic would be if the boring world were true — that probability of surprise is the p-value. (4) If the data is too surprising (p below a pre-chosen threshold $\alpha$), reject $H_0$ in favor of the alternative $H_1$. It's proof by contradiction with probability instead of certainty: we never prove $H_1$; we show $H_0$ strains to explain the data.
The court-trial analogy maps every piece:
The asymmetry is deliberate: $H_0$ gets the benefit of the doubt because false discoveries are expensive. That's also why "fail to reject" is the correct verdict language — a not-guilty verdict doesn't mean the defendant was proven innocent.
Tiny example: you suspect a coin favors heads, flip it 100 times, and see 60 heads. Under $H_0$ (fair coin), heads count is roughly $\mathcal{N}(50, 25)$, so $z = (60-50)/5 = 2.0$. One-sided p-value: $P(Z \ge 2.0) \approx 0.023$. Reading: if the coin were fair, only about 2.3% of experiments would show 60 or more heads. At $\alpha = 0.05$ we reject fairness. Note what we did not compute: the probability that the coin is fair.
- "p is the probability $H_0$ is true." No. p is $P(\text{data this extreme} \mid H_0)$, not $P(H_0 \mid \text{data})$ — flipping that conditional needs a prior and Bayes' rule (ch14). A p of 0.04 can coexist with $H_0$ being probably true if true effects are rare.
- "p = 0.04 means a 4% chance this finding is a fluke." No. The false-discovery rate among your rejections depends on power and on how often nulls are true — it can easily exceed 30% even when every test uses $\alpha = 0.05$.
- "p > 0.05 means no effect / $H_0$ is true." No. Absence of evidence is not evidence of absence — an underpowered test misses real effects routinely. p = 0.4 means "this data can't distinguish the hypotheses," nothing more.
- "Smaller p means bigger effect." No. p conflates effect size with sample size: with n in the millions, a 0.01% lift yields p $< 10^{-6}$. Always report the effect size and CI alongside p.
- "p is the probability the result replicates." No. Replication probability depends on the true effect and the replication's power; p = 0.05 findings replicate far less than 95% of the time.
Two ways to be wrong: convict the innocent (Type I, probability α — reject a true H0) and acquit the guilty (Type II, probability β — miss a real effect). Power = 1 − β is the probability you detect an effect that's really there.
| Knob | Turn it… | Power | The price |
|---|---|---|---|
| Sample size n | up | ↑ (SE shrinks as 1/√n) | Time and traffic |
| Effect size δ | bigger real effects | ↑ | Not yours to choose — but you CAN target bolder changes |
| α | up (e.g., .05→.10) | ↑ | More false positives |
| Outcome variance σ² | down (CUPED, better metric) | ↑ | Engineering effort |
Underpowered studies are double poison: they miss real effects, AND the effects they do "find" are exaggerated (only lucky overestimates cross the bar — the winner's curse).
Run 20 independent tests at α = .05 on true nulls: P(at least one false positive) = 1 − 0.95²⁰ ≈ 64%, and you EXPECT one "discovery." Segments × metrics × weekly looks multiply into dozens of tests without anyone deciding to test anything.
Rule: FWER (Bonferroni) when a single false claim is expensive; FDR (BH) when you're screening many leads for follow-up.
- Name H0 and the test that matches the data type (means → t; counts → chi-square; paired → paired t).
- Check the test's load-bearing assumptions out loud (independence! — clustered data needs clustered methods).
- Report three numbers, not one: effect size, its 95% CI, and p. Then judge PRACTICAL significance: a p=0.001 win of +0.01% CTR may be worthless.
- Disclose the testing context: how many metrics/segments/looks — and correct accordingly.
Never: say just "p < 0.05 so it's real." That phrase, alone, fails the round.
Hypothesis testing is a controlled-error decision procedure: assume H0, measure how surprising the data is (p), and reject only if it's surprising enough — accepting an α rate of false alarms and a β rate of misses you chose via the power knobs. The p-value is NOT the probability H0 is true, significance is NOT importance, and 20 uncorrected tests manufacture a discovery from nothing. Effect size + CI + p, assumptions checked, corrections disclosed — that trio is the literate answer.
Q1. Define the p-value exactly, then give the two most damaging misreadings.
Q2. Why does an underpowered study EXAGGERATE the effects it finds?
Q3. n=49, x̄=106ms, s=14ms against the claim μ=100ms. Run the test and state the conclusion in plain words.
Q4. When is a paired design dramatically better than two-sample, and what's the ML instance?
Q5. Your arm counts are 50.6%/49.4% on 100k users. Coincidence?
Q6. One-sided or two-sided test for an A/B — and the abuse to watch for?
Q7. Why does independence failure (clustering) wreck a t-test, and what's the magnitude?
Q8. BH vs Bonferroni on 100 segment tests with 5 true effects — what happens under each?
Q9. "Statistically significant but practically meaningless" — construct the example and the decision rule.
Q10. Interviewer hands you p = 0.04 and asks "so there's a 96% chance the feature works?" Give the 30-second correction with a Bayesian bridge.
This chapter walks one experiment end to end — metric choice, randomization unit, a real sample-size calculation, the readout, the decision — then catalogs the deadly sins that invalidate tests in practice (peeking, metric fishing, Simpson's paradox, novelty effects) and the production-grade upgrades: CUPED variance reduction, the delta method for ratio metrics, and cluster randomization for interference. This is the most interview-loaded chapter in Part III: every product data-science and ML loop asks some version of "design an A/B test for X."
The OEC (Overall Evaluation Criterion) is the single metric the test is allowed to move and the one your ship/no-ship decision keys on. Choosing it after seeing results is the original sin of experimentation — with 20 metrics on a dashboard, one will be "significant" at α=0.05 by luck alone (Chapter 12's multiple-testing arithmetic, now with a shipping decision attached).
A good OEC is (a) sensitive enough to move in a 2-week window, (b) causally close to the change, and (c) aligned with long-term value. Revenue is aligned but noisy and slow; click-through rate is sensitive but gameable (clickbait raises CTR while destroying trust). Running example for this chapter: a new ranking model for a feed, OEC = session click-through rate proxy: fraction of users who click at least one recommended item, baseline 5.0%.
Guardrails are metrics that must NOT regress: latency (p95), crash rate, unsubscribe rate, ad revenue. They get tested too — usually as non-inferiority checks ("rule out a drop bigger than 1%") rather than "detect any improvement." A test that wins the OEC but trips a guardrail does not ship.
The randomization unit is the thing you flip the coin for. Two common choices:
"Randomization unit must be ≥ analysis unit." If you randomize by user but report a per-session metric, sessions within a user are correlated and the naive SE is wrong (too optimistic). Fix with the delta method (below) or bootstrap over users. Interviewers love this trap: "we randomized users but measured per-impression CTR — what's wrong with the t-test?"
The deeper assumption behind any unit choice is SUTVA — no interference: my treatment assignment doesn't change your outcome. Feeds with social sharing, marketplaces, and ride-sharing all violate it (treatment users buying up inventory makes control look worse, inflating the measured effect). We return to this at the end with cluster randomization.
Plain words first: the smaller the effect you want to detect, the more users you need, because you must shrink the noise (standard error ∝ $1/\sqrt{n}$) until the effect stands out from it. The MDE (minimum detectable effect) is the smallest lift you'd care about shipping — a product decision, not a statistics one. Ours: baseline conversion $p_1 = 5.0\%$, MDE = 0.2 percentage points (so $p_2 = 5.2\%$, a 4% relative lift), two-sided $\alpha = 0.05$, power $1-\beta = 80\%$.
Plug in, step by step:
- $(z_{\alpha/2}+z_{\beta})^2 = (1.96+0.84)^2 = 2.8^2 = 7.84$
- $p_1(1-p_1) = 0.05 \times 0.95 = 0.0475$; $p_2(1-p_2) = 0.052\times 0.948 = 0.0493$; sum $= 0.0968$
- $(p_2-p_1)^2 = (0.002)^2 = 4\times 10^{-6}$
- $n = \dfrac{7.84 \times 0.0968}{4\times 10^{-6}} \approx \dfrac{0.7589}{0.000004} \approx 189{,}700$ per arm → round up: ≈190,000 per arm, ≈380,000 users total.
Sanity-check the size: detecting a 0.2pp move on a 5% base rate is hunting a 4% relative change in a coin that comes up heads 1 time in 20 — of course it takes ~400k flips. If the site randomizes ~40,000 eligible users/day at full allocation, that's ~10 days of traffic; run 14 days anyway to cover two full weekly cycles (weekend users behave differently than weekday users).
"What happens to $n$ if the MDE were 0.1pp instead?" Effect halves → denominator shrinks 4× → $n \approx 760{,}000$ per arm. Interviewers check you know $n \propto 1/\text{MDE}^2$, and that you treat MDE as the smallest effect worth shipping, not a guess at the true effect.
Run: hash user IDs (with an experiment-specific salt — reusing one hash across experiments correlates assignments) into 50/50 arms; before unblinding the OEC, run an SRM check (sample-ratio mismatch): a chi-square test that the realized split is actually 50/50. Getting 189,000 vs 191,000 when you expected equal arms is a red flag for broken assignment or differential logging loss — the most common silent killer of real experiments. SRM fails → throw the test away, no exceptions.
Analyze: after 14 days, control: 9,500 conversions / 190,000 ($\hat p_C = 5.00\%$); treatment: 9,956 / 190,000 ($\hat p_T = 5.24\%$). Two-proportion z-test (Chapter 12):
Decide: report effect size with uncertainty, never p alone: lift = +0.24pp, 95% CI [+0.10pp, +0.38pp] ($0.0024 \pm 1.96\times 0.000715$), p ≈ 0.0008. The CI sits entirely above zero and its lower end (+0.10pp) — check it against the MDE bar and launch cost. Guardrails clean, SRM clean → ship. Note the honest caveat: the CI is wide relative to the point estimate; the true lift could plausibly be half or 1.5× what we measured.
The α=0.05 guarantee is for one look at the data. Peek daily and decide "stop, it's significant!" whenever $|z|>1.96$, and you are running many correlated tests, taking the maximum excursion of the z-statistic over time. Under $H_0$ the cumulative z wanders like a random walk around 0 — give it 10 chances to touch ±1.96 and it will, far more than 5% of the time. Simulation: checking 10 times during a null experiment and stopping at the first "significant" result yields a false-positive rate of ≈19%, not 5%. Peek continuously forever and the random walk crosses any fixed bar with probability → 1: every flat null "wins" eventually.
The asymmetry that makes peeking poisonous: a true effect crossed-then-uncrossed still ends significant at day 14, but a null that grazes the line gets frozen as a win the moment you stop. Stopping rules harvest noise. Fixes, in order of practicality:
- Don't peek — fix the horizon at design time, look once. Simple, valid, what most teams should do.
- Alpha-spending / group-sequential tests (O'Brien–Fleming, Pocock): pre-plan K looks and spend the 5% budget across them with wider early bars (e.g., first look needs |z| > 4, final look ≈ 2.0). Early stopping for huge wins/harms stays legal.
- Always-valid inference (mSPRT / confidence sequences, the "anytime-valid" machinery behind Optimizely-style dashboards): intervals that hold simultaneously over all times, so you may stop whenever you like — at the price of wider intervals.
HARKing on 20 metrics. The flat-OEC test where "engagement among Android users aged 25–34 is up, p=0.03!" With 20 metrics × a handful of segments, dozens of tests run implicitly; at α=0.05 you expect spurious wins (Chapter 12). Rule: one pre-registered OEC decides; secondary metrics and segments generate hypotheses for the next test, or must survive an FDR correction.
Simpson's paradox. Segment readouts can tell the exact opposite story from the aggregate when arm composition differs across segments. Concrete table — treatment loses in every segment yet wins overall:
| Segment | Control | CTR | Treatment | CTR |
|---|---|---|---|---|
| Mobile | 90 / 1,000 | 9.0% | 765 / 9,000 | 8.5% |
| Desktop | 360 / 9,000 | 4.0% | 38 / 1,000 | 3.8% |
| Overall | 450 / 10,000 | 4.5% | 803 / 10,000 | 8.03% |
Treatment is worse on mobile (8.5% < 9.0%) and worse on desktop (3.8% < 4.0%), yet "wins" overall (8.03% > 4.5%) — purely because the treatment arm is 90% mobile (the high-CTR segment) while control is 90% desktop. Nothing here is a math error; the aggregate is confounded by segment mix. In a properly randomized test both arms get the same mix, so this table is itself diagnostic: composition this lopsided means broken randomization (and the SRM check per segment would have caught it — e.g., a staged rollout that ramped treatment on mobile first).
Novelty and primacy effects. A shiny new button gets clicked because it's new (novelty: effect decays); a redesigned flow looks bad because users must relearn it (primacy: effect grows). Both make week-1 results lie about steady state. Diagnose by plotting the treatment effect by day or by user-exposure cohort: a real effect is flat, novelty slopes down, primacy slopes up. Mitigations: run ≥2 weeks, weight later days, or analyze only users past their first few exposures.
- Hypothesis & OEC: one primary metric, declared before launch.
- Guardrails: latency, integrity, revenue — must-not-regress bounds.
- Randomization unit: user (default); cluster if interference; consistent hashing for stable assignment.
- Power & duration: n from baseline rate, MDE, α=.05, power=.8 — and run ≥2 full weeks regardless, for weekly cycles and novelty decay.
- Variance reduction: CUPED with pre-period data; trigger-based analysis if only some traffic is affected.
- Run discipline: no peeking (or pre-chosen sequential boundaries); SRM check (sample-ratio mismatch = broken assignment, invalidates everything).
- Analysis: effect + CI (not just p), guardrail readout, key segments (with multiple-testing care), days-curve for novelty.
- Decision: pre-agreed ship/iterate/abandon criteria; long-term holdback if the change is big.
Never: answer with only "split traffic 50/50 and t-test" — steps 2, 3, 6 are where real experiments die.
A/B testing is the CLT industrialized — but the statistics are the easy half. The hard half is discipline: pick the OEC and guardrails first, randomize at the unit where interference lives, size from the MDE (5%→5.2% CTR needs ~100k/arm, not 1k), never peek without sequential corrections, check SRM before believing anything, and read segments as hypotheses, not conclusions (Simpson's paradox is real and routine). CUPED is the free lunch; interference is the silent killer; the 8-step recital covers both.
Q1. Size this test: baseline CTR 5%, MDE 0.2pp (5.0→5.2%), α=0.05, power 80%. Ballpark n per arm.
Q2. Why does peeking inflate false positives — give the intuition and the legal alternatives.
Q3. Construct a Simpson's-paradox readout: treatment wins overall but loses every segment. Numbers, please.
Q4. What is SRM and why does it invalidate the whole experiment rather than just adding noise?
Q5. CUPED with ρ = 0.6 between pre-period and experiment metric: what's the variance reduction and the sample-size implication?
Q6. Marketplace test: treated buyers get better search. Why is user-randomization wrong, and what's the design?
Q7. The treatment effect is +4% in week 1, +1.5% in week 2, +0.3% in week 3. Read it.
Q8. You ran 20 metrics; "time in app for Android tablet users" hit p=0.03. Ship the story?
Q9. Your metric is revenue-per-user, which is 95% zeros and wildly skewed. Does the t-test even work, and what's better?
Q10. PM: "The test is at p=0.06 after the planned run. Can we extend it a week to get it over the line?"
This chapter closes the statistics part by grounding Bayesian reasoning in the methods that actually run in production: Beta-Bernoulli conjugacy, credible intervals, Bayesian A/B testing, Thompson sampling, and empirical-Bayes shrinkage. Every technique here solves a concrete problem that frequentist tools handle awkwardly or not at all — small-sample multi-arm decisions, continuous peeking, and pooling estimates across segments.
The debate often generates more heat than light. Here is what actually differs:
| Question | Frequentist | Bayesian |
|---|---|---|
| What is "probability"? | Long-run frequency of repeatable events | Degree of belief, updated by evidence |
| What is random? | The data (θ is fixed, unknown) | Both data and θ (θ has a distribution) |
| What does a 95% interval mean? | The procedure covers the true θ in 95% of repeated experiments | There is 95% posterior probability that θ lies in this interval |
| How is prior knowledge used? | Not formally — only via study design | Explicitly as a prior distribution |
| Result format | Point estimate + CI + p-value | Full posterior distribution |
| When it shines | Large-n, well-designed experiments, regulatory settings | Small n, sequential decisions, many parameters, need for uncertainty quantification |
| When it struggles | Sequential peeking; small-n many-group problems | Prior choice matters; computation can be expensive |
"Bayesian is subjective, frequentist is objective" — Both require choices. Frequentist analysis requires choosing the test statistic, stopping rule, and significance threshold. Bayesian analysis requires choosing the prior. Neither is assumption-free. The real question is: which assumptions are easiest to make explicit and defend for this problem?
A conjugate prior is a prior whose functional form is preserved under the likelihood — the posterior belongs to the same family. This turns inference into simple parameter arithmetic.
Setup. We are estimating the bias θ of a coin. We choose a Beta prior because it lives on [0,1] and has a natural pseudo-count interpretation:
Choosing the prior. We start with Beta(2,2) — weakly centered on 0.5, expressing mild belief the coin is fair. The mean is 2/(2+2) = 0.5. This is equivalent to having seen 1 imaginary head and 1 imaginary tail (the shape α−1 = 1, β−1 = 1).
Observing data. We flip the coin 10 times and see 7 heads, 3 tails. The likelihood of this data given θ is Binomial:
Updating (Bayes' rule). Multiply prior by likelihood:
What changed? The posterior mean is 9/(9+5) = 0.643. The prior mean was 0.5; the MLE would be 7/10 = 0.7. The posterior sits between them, pulled toward the MLE by data but tempered by the prior. With only 10 flips, the prior still matters.
The conjugate update rule says: posterior parameters = prior parameters + data counts. This gives a clean mental model called pseudo-counts:
This is why regularization and Bayesian priors are the same thing in different clothes: L2 regularization on logistic regression is exactly MAP inference under a Gaussian prior; a Beta(α,β) prior on a Bernoulli rate is exactly adding α+β−2 pseudo-observations.
These sound identical but mean entirely different things.
- Frequentist 95% confidence interval: A procedure that, if repeated on many datasets, would contain the true θ in 95% of cases. For this specific interval, θ is either in it or not — no probability statement about this interval is valid. The randomness is in the procedure, not in θ.
- Bayesian 95% credible interval (highest-density interval, HDI): Given this data and this prior, there is a 95% posterior probability that θ lies in this interval. The randomness is in θ (it has a posterior distribution). This is the statement people intuitively want to make.
Numeric example. Beta(9,5) posterior for our coin: the 95% equal-tailed credible interval is approximately [0.38, 0.86]. Correct interpretation: "Given our prior and 7/10 heads, we believe θ ∈ [0.38, 0.86] with 95% probability." Wrong interpretation of a frequentist CI: "There's a 95% chance the true θ is in [0.42, 0.98]."
In practice: with flat (non-informative) priors and large n, credible and confidence intervals are numerically close. The difference matters most in small-n settings and when the prior is informative.
Classical A/B testing answers: "Is the difference statistically significant?" Bayesian A/B testing answers: "What is the probability that B is better than A, and by how much?"
Setup. Variant A has prior Beta(αA, βA) and observes nA conversions from NA trials. Variant B similarly. After observing data:
Concrete example. Start both arms at Beta(1,1). After the experiment, A has 40 conversions / 200 trials (posterior Beta(41,161)), B has 52 / 200 (posterior Beta(53,149)). Sample 100k draws from each posterior; compute the fraction of times θB > θA. If that fraction is 0.97, you say: "There is a 97% posterior probability that B's true rate exceeds A's."
Choosing between arms (ads, headlines, ranking tweaks): sample each arm's conversion rate from its posterior, play the arm whose SAMPLE wins, update. Uncertain arms get explored exactly in proportion to their probability of being best — exploration falls out of Bayes for free.
« 8 lines, the whole algorithm »
alpha = [1, 1]; beta = [1, 1] # Beta(1,1) priors per arm
for t in range(100000):
theta = [random.betavariate(alpha[k], beta[k]) for k in (0, 1)]
k = 0 if theta[0] > theta[1] else 1 # play the sampled best
reward = pull(k) # 0 or 1
alpha[k] += reward
beta[k] += 1 - reward
Production cameos: ad/creative selection, artwork choice, explore slots in recommenders. Strengths: no peeking problem, automatically tapers exploration. Caveats: delayed/batched rewards need care, and nonstationary rates need discounting or sliding windows.
Early-season: one player bats 9/20 (.450), another 60/200 (.300). Best guess at true skill? NOT the raw averages — small samples scatter wildly around true skill (chapter 8's law of total variance). Fit a prior from ALL players (league skill ≈ Beta centered ~.27), then each player's posterior mean = weighted blend of their data and the league prior, weights ∝ sample size. The 9/20 hitter shrinks hard toward .27; the 60/200 hitter barely moves. Stein's lesson: shrinking EVERYONE toward the center beats trusting each raw average — uniformly.
Recsys cash-out: item CTR = (clicks + α)/(impressions + α + β) with α, β fit from the catalog — exactly this machinery; "smoothed CTR" is empirical Bayes wearing a production name.
- Lots of data, repeated decisions, error-rate guarantees wanted → frequentist machinery is simpler and audited (A/B platforms).
- Small data, strong prior knowledge, or sequential decisions → Bayesian (cold-start priors, bandits, hierarchical estimates).
- Stakeholder asks "probability this is better?" → that's a posterior; answer it with Bayes or translate the question.
- Always state the prior and show sensitivity — a Bayesian answer whose prior is hidden is just opinion with integrals.
Bayes turns priors plus data into posteriors; conjugacy (Beta-Binomial) makes the update arithmetic — pseudo-counts in, posterior out, MAP→MLE as data swamps the prior. Credible intervals say what everyone wishes confidence intervals said. Thompson sampling turns posteriors into an explore/exploit policy with no peeking problem, and empirical Bayes shrinkage is the production workhorse: estimate the prior from the population, shrink every small-sample estimate toward it, win uniformly.
Q1. Prior Beta(2,2), observe 7 heads in 10 flips. Posterior, posterior mean, and the pseudo-count story?
Q2. Credible vs confidence interval — one sentence each, and when do they numerically agree?
Q3. Why does Bayesian A/B testing have "no peeking problem," and what's the honest caveat?
Q4. L2 regularization = which prior? L1? Derive the L2 case in two lines.
Q5. Implement the smoothed-CTR formula and pick α, β for a catalog whose global CTR ≈ 2% with effective strength 100 impressions.
Q6. Thompson sampling vs ε-greedy vs UCB in one breath each — and which ships most in industry?
Q7. Your Bayesian readout says P(B>A) = 96%. The frequentist test says p = 0.11. How can both be true, and what do you tell the PM?
Q8. When does empirical-Bayes shrinkage actively mislead?
Interviewers recycle the same dozen puzzles dressed in different costumes. This chapter strips the costumes off: you will learn the seven patterns that generate almost every classic, together with the rule for recognizing each pattern in the first ten seconds and the 30-second answer you deliver before diving deeper. Two or three fully worked classics per pattern anchor the rules in memory.
Memorising 30 individual puzzles is fragile — a slight rewordering breaks recall. Internalising 7 patterns is robust: a new costume on a familiar skeleton is immediately recognisable. The workflow is always the same:
- Classify — which pattern does this smell like?
- State the 30-second answer — headline number or formula, out loud, immediately.
- Show the reasoning — derive it cleanly; the interviewer is watching your process.
- Sanity-check — plug in extreme values; does the answer make sense at the limits?
When the problem has a uniform exchangeability structure — every seat, card, or person is interchangeable — you can skip counting and appeal directly to symmetry. The probability that some specific element lands in a particular slot equals 1/(number of slots).
Trigger: "N people/cards/items are arranged randomly — what's the probability that [one specific item] ends up in [a specific position or relationship]?"
- Argue symmetry: all arrangements are equally likely, so any particular item occupies any particular slot with probability 1/N.
- If the event is a relationship (e.g., "A is before B"), reduce: exactly half of all arrangements have A before B, so P = 1/2.
- State the answer immediately; then justify with a two-line counting argument if pushed.
Never: enumerate arrangements for large N. Symmetry is exact and instantaneous.
30-second answer: The last passenger sits in their own seat with probability 1/2.
Setup. 100 passengers, 100 seats. Passenger 1 picks a random seat (theirs or someone else's). Each subsequent passenger sits in their assigned seat if it is free, otherwise picks randomly among the remaining seats. What is the probability passenger 100 ends up in seat 100?
Deeper reasoning. The key insight: the process ends the first time either seat 1 or seat 100 is chosen at random. By symmetry, these two outcomes are equally likely whenever a random choice occurs, regardless of how many other passengers have sat down. Therefore P(passenger 100 gets seat 100) = 1/2.
Formal proof by induction (give if pushed). Let $p_n$ = probability the last passenger gets their seat with $n$ passengers. For $n=2$: passenger 1 picks randomly, P = 1/2. For $n>2$: passenger 1 picks seat 1 (prob 1/n, last gets their seat), seat n (prob 1/n, last does NOT), or seat $k$ for $2 \le k \le n-1$ (prob $(n-2)/n$, reduces to subproblem of size $n-k+1$, which by induction also gives 1/2). So $p_n = 1/n + 0 + \tfrac{n-2}{n}\cdot\tfrac{1}{2} = \tfrac{1}{2}$ for all $n \ge 2$.
Sanity check. $n=2$: passenger 1 picks randomly, so P = 1/2. ✓
30-second answer: The expected number of cards you flip before seeing the first ace is 48/5 = 9.6.
Setup. A standard 52-card deck is shuffled uniformly at random. You flip cards one by one. How many non-ace cards appear before the first ace?
Symmetry argument. Think of the 52 cards as 4 aces and 48 non-aces. Consider the 5 "gaps" formed by the 4 aces in a random arrangement: the gap before the first ace, the gaps between consecutive aces, and the gap after the last ace. By symmetry, each of the 48 non-aces is equally likely to fall in any of the 5 gaps. So the expected number before the first ace is 48/5.
Alternative — indicator method. For each non-ace card $i$, let $I_i = 1$ if card $i$ appears before all 4 aces. Among card $i$ and the 4 aces (5 cards total), by symmetry $P(I_i=1)=1/5$. By linearity:
When the process is sequential and random, condition on what happens at time 1 and write a recursion. This turns a hard global question into a tractable local one. Most "how many flips/steps until…" questions live here.
Trigger: "Expected number of flips/rolls/steps until [pattern/event]" — especially when the target is a sequence like HH or HTH.
- Define states that capture all the memory you need (which prefix of the target pattern you have seen so far).
- Write E[steps from state s] = 1 + (weighted average of next states).
- Solve the system of equations — usually 2–4 equations.
- State the final number and explain the key asymmetry (overlap vs no overlap).
Never: try to enumerate all possible flip sequences. The recursion is the way.
30-second answer: HH takes 6 flips; HT takes 4 flips. They differ because HH has an "overlap" that resets progress.
Why they differ — the overlap insight. After seeing H, a T in the HT case completes the target. A T in the HH case doesn't waste the H — wait, it does: you need HH so a T resets to the start. But a second H completes. For HH: after seeing H and then H, you're done; but after seeing H and then T, you're fully back to start (no overlap). For HT: after seeing H and then H, you're still in state "last flip was H" — the new H is potentially the start of HT. So HT has an overlap that HH lacks, making HT easier to achieve.
HT — 3-state solution. States: $S_0$ = start, $S_H$ = last flip was H, $S_{HT}$ = done.
- From $S_0$: flip H (prob 1/2) → $S_H$; flip T (prob 1/2) → $S_0$. So $E_0 = 1 + \tfrac{1}{2}E_H + \tfrac{1}{2}E_0$.
- From $S_H$: flip T (prob 1/2) → done; flip H (prob 1/2) → $S_H$. So $E_H = 1 + \tfrac{1}{2}\cdot 0 + \tfrac{1}{2}E_H$.
Solving: $E_H = 2$; $E_0 = 1 + \tfrac{1}{2}\cdot 2 + \tfrac{1}{2}E_0 \Rightarrow E_0 = 4$.
HH — 3-state solution. States: $S_0$ = start, $S_H$ = last flip was H, $S_{HH}$ = done.
- From $S_0$: flip H → $S_H$; flip T → $S_0$. So $E_0 = 1 + \tfrac{1}{2}E_H + \tfrac{1}{2}E_0$.
- From $S_H$: flip H (prob 1/2) → done; flip T (prob 1/2) → back to $S_0$ (no overlap!). So $E_H = 1 + \tfrac{1}{2}\cdot 0 + \tfrac{1}{2}E_0$.
Solving: $E_0 = 1 + \tfrac{1}{2}E_H + \tfrac{1}{2}E_0$; $E_H = 1 + \tfrac{1}{2}E_0$. Substituting: $E_H = 1 + \tfrac{1}{2}(1 + \tfrac{1}{2}E_H + \tfrac{1}{2}E_0) = \ldots$ → $E_0 = 6$.
30-second answer: Starting with $k$ dollars, playing against a casino with $N$ dollars total (fair coin), probability of ruin is $(N-k)/N$ and expected duration is $k(N-k)$.
Setup. A gambler starts with $\$k$. At each step: win $\$1$ (prob $p$) or lose $\$1$ (prob $q=1-p$). Game ends when fortune hits 0 (ruin) or $N$ (win). What is the probability of ruin?
Condition on first step. Let $r_k = P(\text{ruin} | \text{start at } k)$. Then $r_k = p \cdot r_{k+1} + q \cdot r_{k-1}$ with $r_0=1$, $r_N=0$. This is a second-order linear recurrence.
Fair game ($p = q = 1/2$). The recurrence $r_k = \tfrac{1}{2}r_{k+1} + \tfrac{1}{2}r_{k-1}$ has solutions $r_k = A + Bk$. Boundary conditions: $r_0=1 \Rightarrow A=1$; $r_N=0 \Rightarrow B = -1/N$. So:
Intuition. In a fair game, the gambler's fortune is a martingale — its expected value never changes. So $E[\text{final fortune}] = k$. With probability $r_k$ the final fortune is 0 and with probability $1-r_k$ it is $N$: so $k = 0 \cdot r_k + N(1-r_k) \Rightarrow r_k = (N-k)/N$. The martingale optional stopping theorem makes this rigorous.
Linearity of expectation holds even under dependence. Any "expected count" question can be written as a sum of indicator random variables, each with a simple probability. The indicators may be wildly dependent — it does not matter.
Trigger: "Expected number of [events/matches/collisions/runs]" where counting directly seems hard.
- Define $I_j = \mathbf{1}[\text{event } j \text{ occurs}]$ for each possible event.
- Compute $P(I_j = 1)$ — usually a simple symmetry or independence argument.
- Apply $E[\sum_j I_j] = \sum_j P(I_j=1)$ — no dependence check needed.
- Give the final sum; note that you leveraged linearity, which works even under dependence.
Never: try to compute the full joint distribution of all indicators. You don't need it.
30-second answer: The expected number of people who get their own hat back is 1, for any $n \ge 1$.
Setup. $n$ people check their hats; a careless attendant returns them in a uniformly random order. What is the expected number of people who receive their own hat?
Indicator argument. Let $I_i = 1$ if person $i$ gets their own hat. Then $P(I_i = 1) = 1/n$ by symmetry (person $i$'s hat is equally likely to be in any of the $n$ positions). By linearity:
Bonus — the derangement connection. The probability that nobody gets their hat ($P(\text{derangement})$) converges to $1/e \approx 0.368$ as $n \to \infty$. The expected count is 1 even though most of the time the actual count is 0 or 1 — a striking illustration of how expectation and typical value can diverge.
30-second answer: In a sequence of $n$ fair coin flips, the expected number of "runs" (maximal blocks of identical values) is $\frac{n+1}{2}$.
Setup. A run is a maximal consecutive block of the same symbol. HHTTHH has 3 runs. What is the expected number of runs in $n$ i.i.d. fair flips?
Indicator argument. The first flip always starts a run. For each subsequent flip $i = 2, \ldots, n$: $I_i = 1$ if flip $i$ differs from flip $i-1$, i.e., it starts a new run. $P(I_i=1) = 1/2$. So:
Sanity check. $n=1$: 1 run. $(1+1)/2 = 1$. ✓ $n=2$: either HH, TT (1 run, prob 1/2) or HT, TH (2 runs, prob 1/2), so E = 3/2 = (2+1)/2. ✓
30-second answer: To collect all $n$ distinct coupons (one per box, uniform), expected boxes needed is $n H_n = n(1 + 1/2 + 1/3 + \cdots + 1/n) \approx n \ln n$.
Indicator argument (waiting times). Let $T_k$ = additional boxes needed to go from $k-1$ distinct coupons to $k$ distinct. When you have $k-1$ coupons, probability the next box is new is $(n-k+1)/n$. So $T_k \sim \text{Geometric}((n-k+1)/n)$ and $E[T_k] = n/(n-k+1)$.
ML cameo. Coupon collector gives the expected number of samples before you see every class at least once — relevant for class-balanced sampling and estimating coverage in large categorical spaces.
$P(\text{at least one}) = 1 - P(\text{none})$. The complement is almost always easier to compute when events are independent (none = product of individual probabilities). The birthday problem is the canonical example.
Trigger: "What's the probability that at least one [collision/match/event] occurs?" — especially with independent trials.
- Compute P(no event at all) — usually a clean product over independent trials.
- Subtract from 1.
- If events are not independent, fall back to inclusion-exclusion (but complement of intersection works for many cases).
Never: enumerate all the ways at least one event can happen. The complement is the direct route.
30-second answer: With 23 people, P(shared birthday) exceeds 50%. With 57 people it exceeds 99%.
Why the number surprises. It's not about finding someone who shares YOUR birthday (that needs ~183 people). It's about any pair, and pairs grow as $\binom{n}{2}$.
Exact calculation. Assume 365 equally likely birthdays. P(no shared birthday among $n$ people):
At $n=23$: this product $\approx 0.493$, so P(collision) $\approx 0.507 > 50\%$.
Quick approximation. Using $\ln(1-x) \approx -x$ for small $x$:
Interview follow-up. "How many people until P > 99%?" Same formula: $n \approx \sqrt{2 \cdot 365 \cdot \ln 100} \approx 57$.
Some puzzles are designed to trick your intuition by hiding information. The cure is mechanical: write out the sample space explicitly, apply Bayes' rule, and be precise about exactly what information was revealed and how it was revealed. Intuition is optional; the calculation is not.
Trigger: "Given that [surprising information], what's the probability of [outcome]?" — Monty Hall, boy-or-girl, envelope variants.
- Write the full prior sample space (list all scenarios).
- Apply the observation as a filter: cross out scenarios inconsistent with what was revealed.
- Re-normalise over the remaining scenarios.
- Ask: was the information selection-biased? Monty always opens a goat door — that selection changes the posterior. A random stranger who happens to have a son — different from "I have at least one son."
Never: reason from pure intuition on paradox problems. Always enumerate and apply Bayes.
30-second answer: Always switch. Switching wins with probability 2/3; staying wins with probability 1/3.
Setup. Three doors: one car, two goats. You pick door 1. Monty (who knows) opens a goat door (say door 3). Should you switch to door 2?
Posterior calculation.
Monty opened door 3 in two of the three scenarios (with equal probability conditional on each). Given Monty opened door 3: P(car at door 2 | door 3 opened) = (1/3)/(1/2) = 2/3.
The 1000-door amplifier. Still unconvinced? Imagine 1000 doors: you pick door 1 (1/1000 chance of winning). Monty opens 998 goat doors, leaving door 473. Now switch? Of course — door 473 has prob 999/1000. Monty's knowledge concentrates all the "non-1" probability into the one remaining door.
Why intuition fails. The key is that Monty's choice is not random — he has information and acts on it. This selection-bias is exactly what Bayes' rule captures.
30-second answer: The answer depends critically on how you learned the information. Two phrasings, two answers.
Version 1: "I have two children. One is a boy. What's P(both boys)?"
Sample space of two children (equally likely): BB, BG, GB, GG. Ruling out GG: {BB, BG, GB}. P(BB | at least one boy) = 1/3.
Version 2: "I have two children. The older is a boy. What's P(both boys)?"
Sample space conditioning on the older child: (older=B, younger=B) or (older=B, younger=G). P(BB | older is boy) = 1/2.
The lesson. "At least one boy" is weaker information than "the older is a boy." The first reveals a property of the pair; the second reveals a property of a specific child. Specify exactly what is revealed and how — then calculate mechanically.
Common error: treating "I have at least one boy" and "I randomly selected one child and they're a boy" as the same. They're not. If you randomly sample one child and reveal they are a boy, the posterior is 1/2 (same as Version 2). The selection mechanism changes the posterior.
30-second answer: There is no switching strategy that improves your expected payoff — the apparent "gain by switching" argument contains a hidden error.
The puzzle. Two envelopes: one has $\$X$, the other $\$2X$. You open one and see amount $m$. The other has either $2m$ or $m/2$. Seemingly: E[switch] = $\tfrac{1}{2}(2m) + \tfrac{1}{2}(m/2) = 1.25m > m$. Switch!
The error. The two cases cannot both be equally likely for every value of $m$. If $X$ has a prior distribution, P(other = 2m | saw m) ≠ 1/2 in general — it depends on the prior over $X$. For a proper (integrable) prior on $X$, the gain from switching integrates to zero. The paradox exists because you're applying an improper uniform prior over all positive reals, which isn't a valid probability distribution.
When the sample space is a geometric region (line segment, square, circle), probability = (favorable area)/(total area). Draw the region, shade the favorable set, compute the ratio.
Trigger: "Two points/times chosen uniformly on [0,1] (or [0,L])…" — meeting problems, stick-breaking, triangle inequalities.
- Draw the unit square (or appropriate region) representing all $(x, y)$ pairs.
- Shade the favorable region: express the constraint as an inequality in $x$ and $y$.
- Compute shaded area / total area.
Never: try to compute this with discrete approximations or integrals if geometry gives the answer directly.
30-second answer: Break a stick at two random points; probability the three pieces form a triangle is 1/4.
Setup. Choose two break points $X, Y$ uniformly on $[0,1]$. WLOG assume $X < Y$; the three pieces have lengths $X$, $Y-X$, $1-Y$.
Triangle inequality conditions. Three lengths form a triangle iff each is less than 1/2 (since the longest side must be less than the sum of the other two, which equals 1 minus the longest). So all three conditions reduce to: $X < 1/2$, $Y - X < 1/2$, $1 - Y < 1/2$.
Area calculation. In the full unit square (not just $X < Y$), by symmetry: P(triangle) = 1/4. The favorable region is a triangle with vertices at $(0, 1/2)$, $(1/2, 1/2)$, $(1/2, 1)$, which has area $= \tfrac{1}{2} \cdot \tfrac{1}{2} \cdot \tfrac{1}{2} = 1/8$ of the upper triangle (where $Y > X$), doubling for the full square gives 1/4.
30-second answer: Two people each arrive uniformly in $[0, 60]$ minutes and wait 15 minutes. P(they meet) = 7/16.
Setup. Let $X$ and $Y$ be arrival times, uniform on $[0, 60]$. They meet if $|X - Y| \le 15$.
Geometric solution. Total area = $60^2 = 3600$. The non-meeting region consists of two triangles where $X - Y > 15$ or $Y - X > 15$, each with legs $60 - 15 = 45$. Non-meeting area = $2 \cdot \tfrac{1}{2} \cdot 45^2 = 45^2 = 2025$. P(meet) = $(3600 - 2025)/3600 = 1575/3600 = 7/16$.
A martingale is a process whose expected future value, given the present, equals the present value. This is an enormously powerful tool: if you can identify a martingale, optional stopping gives you the expected stopping time or probability for free — without solving a recursion explicitly.
The drunkard's walk. A drunkard starts at position $k$ on a number line, steps right (+1) or left (−1) with equal probability. Absorbing barriers at 0 and $N$. The position itself is a martingale.
Trigger: "Expected steps/time until absorption" or "probability of reaching A before B" in a symmetric random walk.
- Identify the martingale: for a fair walk, $X_t$ (position) is a martingale. For "expected steps," $X_t^2 - t$ is also a martingale.
- Apply optional stopping: $E[X_\tau] = E[X_0]$ where $\tau$ is the stopping time.
- Express $E[X_\tau]$ in terms of boundary probabilities (P(hit $N$) vs P(hit $0$)) and solve.
- For expected steps, use the second martingale $X_t^2 - t$.
Never: try to compute the distribution of $\tau$ directly for the expected value — the martingale trick bypasses it entirely.
30-second answer: Starting at $k$, P(reach $N$ before 0) = $k/N$. Expected steps to absorption = $k(N-k)$.
Probability via martingale. $X_t$ (position) is a martingale (fair walk). By optional stopping: $E[X_\tau] = X_0 = k$. At $\tau$, $X_\tau = N$ (prob $p$) or $X_\tau = 0$ (prob $1-p$). So $Np + 0 = k \Rightarrow p = k/N$.
Expected steps via the second martingale. $M_t = X_t^2 - t$ is a martingale (each step: $E[X_{t+1}^2] = \tfrac{1}{2}(X_t+1)^2 + \tfrac{1}{2}(X_t-1)^2 = X_t^2 + 1$, so $E[M_{t+1}] = M_t$). At stopping time $\tau$: $E[X_\tau^2 - \tau] = X_0^2 - 0 = k^2$. But $E[X_\tau^2] = N^2 \cdot (k/N) + 0 \cdot (1-k/N) = kN$. Therefore $E[\tau] = kN - k^2 = k(N-k)$.
Sanity check. $k = N/2$ (starting in the middle): P(win) = 1/2 and E[$\tau$] = $(N/2)^2 = N^2/4$ — maximum expected duration at the symmetric starting point. ✓
30-second answer: Biased walk with $p \ne 1/2$; P(ruin from $k$) = $[(q/p)^k - (q/p)^N]/[1 - (q/p)^N]$ where $q = 1-p$.
Martingale for biased walk. For $p \ne 1/2$, position $X_t$ is no longer a martingale (it has drift). Instead, $M_t = (q/p)^{X_t}$ is a martingale (verify: $E[M_{t+1}|X_t] = p(q/p)^{X_t+1} + q(q/p)^{X_t-1} = (q/p)^{X_t}[p(q/p) + q(p/q)] = (q/p)^{X_t}$). By optional stopping:
Solving with $P(\text{reach }N) + P(\text{ruin}) = 1$ gives P(ruin from $k$) = $\frac{(q/p)^k - (q/p)^N}{1 - (q/p)^N}$. When $p < 1/2$ and $N \to \infty$: P(ruin) → 1 (house always wins).
| Pattern | Signal phrase | Tool | Key insight |
|---|---|---|---|
| 1 · Symmetry | "randomly arranged," "uniform" | P = 1/N or 1/2 | All positions interchangeable |
| 2 · First step | "expected flips until," "first time" | State machine + recurrence | Overlap = longer wait |
| 3 · Indicators | "expected number of" | $\sum P(I_j=1)$ | Works under dependence |
| 4 · Complement | "at least one," "any pair" | 1 – P(none) | P(none) = clean product |
| 5 · Conditioning paradox | "given that," "Monty," "boy-girl" | Enumerate + Bayes | Ask: how was info selected? |
| 6 · Geometric | "uniform on [0,1]," "arrive randomly" | Favorable area / total area | Draw the square, shade, compute |
| 7 · Martingale | "random walk," "absorption," "ruin" | Optional stopping theorem | Find the right martingale |
- Name the pattern in the first 10 seconds; the answer structure follows from the pattern.
- Linearity of expectation holds even under dependence — the indicator method always works for "expected count" questions.
- HH takes 6 flips; HT takes 4 flips — the overlap asymmetry is the favourite follow-up question.
- Monty Hall: the host's knowledge and non-random selection is the entire mechanism — model it with Bayes.
- Geometric probability: draw the square, shade the favorable region, divide areas.
Seven patterns cover essentially all classic probability puzzles. Symmetry and indicators are the fastest weapons; first-step recursion handles sequential processes; complement sidesteps "at-least-one" counting; Bayes is the antidote to conditioning paradoxes; area ratios solve continuous uniform problems; martingales give absorption probabilities and expected durations without heavy computation. Classify, state the 30-second answer, then derive cleanly.
Q1. A fair coin is flipped until HH appears. A second coin is flipped until HT appears. Which finishes first on average, and by how much?
Q2. You pick one of three doors. Monty opens a door you didn't pick that has a goat. What if Monty doesn't know where the car is and opens a random door — should you still switch?
Q3. 100 people in a circle; each person simultaneously picks a random neighbor. Expected number of pairs that pick each other?
Q4. A deck of 52 cards is shuffled. What is the expected number of cards you flip before seeing the first spade?
Q5. Two points are chosen uniformly on a circle of circumference 1. What is the expected arc length of the shorter arc?
Q6. You roll a fair die repeatedly. What is the expected number of rolls to see all 6 faces?
Q7. A drunkard starts at position 3 on a line with absorbers at 0 and 10. What is the expected number of steps until absorption?
Q8. From a shuffled deck, you draw cards one at a time. What is the probability that the 4th card drawn is the first ace?
Q9. n people sit in a row at random. Two specific people, Alice and Bob, are in the group. What is the probability that they sit adjacent?
Q10. You repeatedly flip a fair coin. What is the probability you see HHH before you see THH?
Q11. The two-child problem: I have two children and at least one is a boy born on Tuesday. What is P(both boys)?
Q12. You and a friend each choose a random integer from 1 to N. What is the expected number of rounds until you choose the same number?
This capstone chapter connects every statistical concept from the page to real ML practice, then stress-tests your recall with 25 rapid-fire Q→A pairs. We trace the loss ↔ likelihood correspondence, explain regularization as Bayesian priors, tie bias-variance to estimator theory, and close with the sampling pathologies that silently corrupt models. Finish here and you can speak fluent statistics in any ML design interview.
Every supervised loss function is the negative log-likelihood of some noise model. Minimising loss = maximising likelihood. The table below is the Rosetta Stone; memorise it.
| Loss function | Noise / likelihood model | Canonical ML task | Key property |
|---|---|---|---|
| Squared error $\tfrac{1}{2}(y-\hat y)^2$ | Gaussian $y\mid x \sim \mathcal{N}(\hat y,\sigma^2)$ | Linear/neural regression | Penalises large errors quadratically; non-robust to outliers |
| Binary cross-entropy $-[y\log\hat p+(1-y)\log(1-\hat p)]$ | Bernoulli $y\mid x \sim \text{Bern}(\hat p)$ | Logistic regression, binary classif. | Log-scale gradient; never saturates on wrong side with sigmoid |
| Categorical cross-entropy $-\sum_k y_k \log\hat p_k$ | Categorical $\mathbf{y}\mid x \sim \text{Cat}(\hat{\mathbf{p}})$ | Softmax classifiers, language models | Equivalent to KL divergence from one-hot to predicted distribution |
| MAE / L1 loss $|y-\hat y|$ | Laplace $y\mid x \sim \text{Laplace}(\hat y,b)$ | Robust regression | Heavier tails → robust to outliers; non-smooth at 0 (subgradients needed) |
| Huber loss (quadratic near 0, linear far) | Huber / pseudo-Huber distribution | Object detection, robust regression | Best of both: smooth + outlier-resistant; threshold $\delta$ is a hyperparameter |
| KL divergence $D_{\text{KL}}(p\|q)$ | Exponential family likelihood | VAE ELBO, knowledge distillation | Asymmetric; $D_{\text{KL}}(p\|q)\neq D_{\text{KL}}(q\|p)$; zero-forcing vs mass-covering |
MAE is not differentiable at zero, so gradient-based optimisers need subgradients and converge more slowly. More importantly, the Laplace assumption has heavier tails than Gaussian, so outlier gradients are clipped linearly, which is good for robustness but bad for precision on well-behaved data. Choose based on the noise model, not just robustness desire.
Adding a regulariser to the loss is identical to adding a prior and doing MAP estimation instead of MLE. The prior encodes your belief about parameter scale before seeing data.
| Regularizer | Equivalent prior on $\theta$ | Effect on weights | Sparsity? |
|---|---|---|---|
| L2 / Ridge $\lambda\|\theta\|_2^2$ | Gaussian $\theta_j \sim \mathcal{N}(0,\,1/2\lambda)$ | Shrinks weights toward 0; all stay non-zero | No |
| L1 / Lasso $\lambda\|\theta\|_1$ | Laplace $\theta_j \sim \text{Laplace}(0,\,1/\lambda)$ | Shrinks and zeroes out weights | Yes — exact zeros |
| ElasticNet $\lambda_1\|\theta\|_1+\lambda_2\|\theta\|_2^2$ | Mixture of Laplace + Gaussian | Sparse but handles correlated features better than pure L1 | Yes — partial |
| Dropout (rate $p$) | Approximate Bayesian posterior (Gal & Ghahramani 2016) | Ensemble-like; acts as adaptive L2 | No |
Worked intuition. Ridge: the Gaussian prior says "weights near zero are exponentially more likely than large weights." So maximising log-prior = $-\lambda\|\theta\|^2$, which is the same as subtracting L2 penalty from the log-likelihood. As $n\to\infty$, the likelihood dominates and MAP → MLE; as $n\to 0$, MAP collapses to the prior mean (zero).
"You switch from L2 to L1 regularisation and training loss barely changes but the number of non-zero features drops by 80%. What happened and is that good?" — Answer: L1 MAP under a Laplace prior has a non-smooth penalty that corners at zero, creating sparse exact-zero solutions. Whether it's good depends on whether interpretability/inference cost matters; L1 can be unstable with correlated features.
The bias-variance tradeoff in ML is exactly the bias-variance decomposition of a statistical estimator. The MSE of any estimator $\hat\theta$ decomposes as:
More data reduces variance (estimators concentrate), but does not reduce bias. If your model is misspecified (wrong functional form), no amount of data fixes systematic error. "More data cures variance; better model cures bias."