FOUNDATIONS · PROBABILITY & STATISTICS

Probability & statistics — the working engineer’s course

From counting to the CLT to A/B testing — the full probability-and-statistics stack behind ML interviews, in textbook order (Blitzstein/Hwang flavor) but ruthlessly practical. Every concept arrives the same way: a tiny example with real numbers first, then the formal version with every symbol glossed, then where it shows up in ML and interviews. Part IV is pure interview craft: the puzzle patterns and the rapid-fire bank.

4 parts 16 chapters 5 reflexes drilled throughout Puzzle patterns not memorized answers
01
PART I · PROBABILITY

How to use this + the map

🎯Probability asks "given a model, what data do I expect?" — statistics runs the arrow in reverse: "given data, what model fits?" — and every interview question is one of these two problems.

This chapter orients you before the deep dives begin. You'll see the fundamental split between probability and statistics, learn exactly which four skill-clusters interviewers probe, and internalize five tactical reflexes that will fire automatically when you're under pressure. Everything on this page gets exercised repeatedly in later chapters.

The two directions: probability vs statistics

Think of a coin with bias p. If someone hands you p = 0.7 and asks "what's the chance of 8+ heads in 10 flips?" — that is a probability question. You know the model; you predict data.

If instead someone hands you the sequence H H T H H H T H H H and asks "is this coin fair?" — that is a statistics question. You have data; you infer the model.

Probability
Model → Data. Given a generative story, compute likelihoods, expectations, distributions of outcomes.
Statistics
Data → Model. Given observations, estimate parameters, test hypotheses, build intervals.
Why both matter in ML
Training loss is statistics (fit a model to data). Generating predictions is probability (run the fitted model forward). A/B testing is statistics. Bayesian inference is the loop: start with a model, observe data, update the model.
The two-arrow diagram: a generative model on the left, observed data on the right; probability arrow points right, statistics arrow points left.
What interviews actually test

After surveying hundreds of ML interview reports, the questions cluster into four archetypes. Know the archetype and you know the playbook.

1 · Probability puzzles
Compute P(event) in a combinatorial or conditional setup. Classic examples: birthday problem, Monty Hall, "at least one six in 4 rolls." The examiner is checking whether you can set up a sample space, not whether you've memorized answers.
2 · Estimator reasoning
Derive or critique an estimator. "What's the MLE for a Gaussian?" "Is the sample variance unbiased?" "When does MAP reduce to MLE?" These are the most common Staff+ questions.
3 · Hypothesis testing & A/B
Design or critique an experiment. "How do you size the sample?" "What does a p-value of 0.03 actually mean?" "What's wrong with peeking?" Interviewers penalize cargo-cult answers ("p < 0.05 means it's real") hard.
4 · Statistical thinking probes
Open-ended: "Our model's precision went up but recall went down — is that better?" "You see a 20% lift in the experiment — what else do you want to check?" No formula required; judgment, skepticism, and knowledge of failure modes are the test.
📐 When the question is ambiguous — the meta-rule

Trigger: The interviewer poses a probability or statistics problem without specifying the setup.

  1. State out loud which archetype it is: "This feels like a conditional-probability puzzle — let me define the sample space first."
  2. Clarify any ambiguous quantities (replacement? ordered? per-user or per-session?). Asking is a signal of rigor, not ignorance.
  3. Use the 5-reflex protocol (below) to move systematically.

Never: jump to a formula before defining what probability space you're working in.

The 5 reflexes — your always-on toolkit

These five moves cover the vast majority of probability puzzles and estimation questions. Internalize them as a mental checklist, not as a linear recipe — sometimes you'll use 1, 3, and 5; sometimes just 4.

Reflex 1 · Define the sample space
Before any arithmetic, name every possible outcome explicitly for small examples, or describe the space precisely for large ones. For 2 dice: "Ω = {(1,1),(1,2),…,(6,6)}, 36 equally likely outcomes." This single step eliminates the majority of puzzle errors.
Reflex 2 · Name the distribution
Ask: "does this situation have a name?" Counting successes in n Bernoulli trials → Binomial. Waiting for the first success → Geometric. Rare events per unit time → Poisson. Named distributions come with free formulas for mean, variance, and tail bounds.
Reflex 3 · Try indicators + linearity
If the question asks for an expected count ("expected number of …"), write the count as a sum of 0/1 indicator random variables and use E[sum] = sum of E[indicators]. This sidesteps dependence entirely — linearity of expectation holds even for dependent indicators.
Reflex 4 · Condition on the first step
For sequential or recursive problems ("expected flips to HH," "gambler's ruin"), condition on what happens at step 1, write a recurrence, solve. This converts a hard global problem into an easy local one.
Reflex 5 · Sanity-check with extreme cases
Plug in n=1, p=0, p=1, or a huge n and ask: does the answer match what common sense demands? A probability that doesn't lie in [0,1] is instantly wrong. An expected value that's negative for a count is wrong. This catches algebra errors before you commit.
Roadmap of this page

The chapters are ordered so each one builds on the previous. Here is where you are headed and why the sequence is the way it is.

Ch 2 · Counting
The arithmetic engine that turns sample spaces from "36 outcomes" to "how many 5-card hands?" Needed before conditional probability.
Ch 3 · Axioms, conditioning, independence
The formal foundation: what probability is, what conditioning means, what independence really requires. Contains the chain rule and the law of total probability — two tools used in almost every later chapter.
Ch 4 · Bayes' rule
The hinge between probability and statistics: how you update beliefs given evidence. Foundations for MLE, MAP, Naive Bayes, and Bayesian A/B.
Ch 5–6 · Random variables & the distribution zoo
Attach numbers to outcomes; build a vocabulary of the 10 distributions you must know by heart.
Ch 7–8 · Expectation, variance, joints
The quantitative workhorse: expected value, variance, covariance, the indicator superpower, tower rule. This is where most estimator-reasoning interview questions live.
Ch 9 · Inequalities, LLN, CLT
Why sample means converge and how fast; the Gaussian approximation; how confidence intervals are born.
Ch 10–11 · Estimators & CIs
MLE derivation from scratch, MAP, bias-variance tradeoff, and how to build and interpret confidence intervals.
Ch 12–13 · Hypothesis testing & A/B
p-values done right, the multiple-testing problem, and an end-to-end A/B test design from metric choice to ship decision.
Ch 14 · Bayesian statistics in practice
Conjugacy, credible intervals, Thompson sampling, and empirical Bayes — the tools behind modern experimentation platforms.
Ch 15–16 · Interview craft
Classic puzzle patterns with 30-second answer templates, and a rapid-fire Q&A over the whole page.
⚠ Clears up: "probability" vs "likelihood"

In everyday English these words are synonyms. In statistics they are NOT. Probability is a function of data given a fixed model: P(data | θ) treated as a random-variable statement. Likelihood is the same expression viewed as a function of θ given fixed data: L(θ) = P(data | θ). Likelihood is NOT a probability distribution over θ — it does not integrate to 1. Confusing the two is a common trap on MLE questions.

⚠ Clears up: correlation ≠ causation (and what to say when asked)

Nearly every "statistical thinking" probe is really asking whether you know the confounders. The safe template: "I see an association. Before concluding causation, I want to check (1) temporal order — does X precede Y? (2) alternative common causes — is there a Z driving both? (3) whether randomization or a natural experiment can help." Saying this template out loud signals strong statistical maturity.

✓ Remember
  • Probability: model → data. Statistics: data → model. Every ML system does both.
  • Four interview archetypes: puzzles, estimator reasoning, testing/A/B, statistical thinking.
  • Five reflexes: sample space → distribution → indicators → condition on first step → extreme cases.
  • "Likelihood" is P(data|θ) viewed as a function of θ; it is NOT a probability over θ.
TL;DR

Probability and statistics are inverse problems. Interviews probe four skill clusters: puzzles, estimator reasoning, A/B/testing, and open-ended statistical judgment. Arm yourself with five reflexes — define the sample space, name the distribution, use indicators, condition on the first step, sanity-check extremes — and you'll have a systematic attack for any question this page throws at you.

Tricky interview questions — chapter 01
Q1. What is the difference between probability and statistics? Give a concrete ML example of each.
Probability takes a generative model as given and computes what data you'd expect — e.g., given that a click model has CTR = 0.05, what's P(zero clicks in 20 impressions)? Statistics runs the arrow the other direction: given 20 impressions with 1 click, estimate CTR and build a confidence interval. In ML, forward passes of a trained model are probability; fitting the model to training data is statistics. The distinction matters because probability questions have exact answers while statistics answers always carry uncertainty that must be quantified.
Q2. What does a p-value of 0.03 mean? (Give the precise definition, not the common shortcut.)
A p-value of 0.03 means: IF the null hypothesis were true, the probability of observing a test statistic at least as extreme as the one we observed is 3%. It is a statement about data under H₀, not about whether H₀ is true. Common wrong answers: "there is a 3% chance the null is true" (wrong — p-values don't put probability on hypotheses in frequentist statistics), or "there is a 97% chance the result is real" (wrong — that would require specifying a prior). The rigorous answer signals you won't cargo-cult the analysis.
Q3. An interviewer asks: "Is this a probability problem or a statistics problem?" You're given a trained Gaussian mixture model and asked to compute P(new point belongs to cluster 1). Which is it?
It is a probability question — specifically, a posterior probability computation under the fitted model. The model is treated as given (parameters fixed), and you compute P(cluster 1 | x, θ) using Bayes' rule over the mixture components. This equals π₁ · N(x; μ₁, Σ₁) divided by the sum over all components. Had the question instead asked "given these 500 points, fit the best-k GMM," that would be statistics (estimation). The arrow direction — model-to-data or data-to-model — is the test.
Q4. Which of the 5 reflexes would you fire first on: "What is the expected number of cards you must draw from a shuffled deck before seeing the first ace?"
Reflex 3 (indicators + linearity) combined with Reflex 1 (define sample space). The count "number of non-aces before the first ace" can be written as a sum of 48 indicator variables — one per non-ace card — where I_i = 1 if the i-th non-ace appears before all 4 aces. By symmetry, each non-ace is equally likely to be in any of the 5 "slots" relative to the 4 aces, so P(I_i = 1) = 4/5. Thus E[cards before first ace] = 48 · (4/5) = 38.4, and including the ace itself: 39.4. Reflex 3 avoids any complicated geometric-series derivation.
Q5. "Likelihood" and "probability" — when do they differ and why does the distinction matter for MLE?
P(data | θ) is a probability statement when θ is fixed and data is the random quantity — it integrates (or sums) to 1 over data. The likelihood L(θ) = P(data | θ) is the same expression with the data fixed and θ varying — it does NOT integrate to 1 over θ. MLE maximizes L(θ) over θ, which is valid because we're treating it as a function of θ, not as a probability distribution. Confusing them leads to the error of interpreting "the likelihood is highest at θ̂" as "θ̂ is the most probable parameter," which would additionally require a prior (that's MAP, not MLE).
Q6. You see that a model's F1 score increased from 0.72 to 0.75 on the test set. Is this improvement real? What do you want to know?
This is a "statistical thinking" probe. I want to know: (1) test set size — with n=50 this difference could easily be noise; with n=50,000 it's almost certainly real. (2) Are confidence intervals on both F1 scores overlapping? Bootstrap the test set to get CIs. (3) Is the test set representative — same distribution as production? (4) Are there subgroup disparities hidden in the aggregate? (5) Was this the only metric checked, or one of many — multiple comparisons inflate false positives. A 0.03 absolute improvement matters operationally only if we also quantify practical significance (business impact at the margin).
Q7. What is Reflex 4 (condition on first step) and when does it help that other methods don't?
Conditioning on the first step means: let E be the quantity you want, condition on all possible outcomes of step 1, and write E as a weighted average of "E given step-1 outcome" — producing a recurrence equation. This shines on sequential/recursive problems where no closed-form counting formula exists. Example: expected flips to see "HH". Let E = expected flips. After one flip: with prob 1/2 we see H (now need E_H more), with prob 1/2 we see T (reset, need E more). After HH we need HH from H state: with 1/2 we're done, with 1/2 we see T and reset. Setting up these two equations and solving gives E = 6, not 4 — a non-obvious result that linearity alone can't easily reach.
Q8. You are told that in a production ML system, the training loss went to zero but the test loss is high. Is this a probability problem, a statistics problem, or something else? Explain.
It is primarily a statistics (estimation/generalization) problem, but the diagnosis uses both directions. Statistics perspective: the estimator (the trained model) has high variance — it fits the training data distribution but fails to generalize, a bias-variance tradeoff failure. Probability perspective: the model is implicitly assuming a data-generating process (e.g., IID draws from some distribution) that is violated in practice, or the test distribution differs from the training distribution (covariate shift). The fix involves regularization (MAP with a prior), more data, or better matching of train and test distributions. Framing it correctly as "the estimator overfit" rather than "the model is bad" shows statistical fluency.
Q9. Explain in one sentence each what the 5 extremes-check reflex would catch in these scenarios: (a) you compute P(at least one head in 10 flips) = 1.024, (b) you compute E[number of 6s in one die roll] = 0.
(a) Probability exceeds 1 — impossible. The complement approach gives P = 1 − (1/2)^10 ≈ 0.999, never greater than 1; something in your arithmetic has a factor-of-something error. (b) Expected number of 6s in one roll = P(6) = 1/6 ≈ 0.167 — not zero. Getting zero would mean you confused "the probability that the roll equals 6" with "the expected count of 6s" after somehow zeroing the indicator. Extreme-case check: if you roll a die 6 million times, you'd expect about 1 million sixes — the expected count per roll must be positive.
Q10. A colleague says: "We saw a statistically significant lift in our A/B test, so we shipped the feature." What questions would you ask before agreeing this was the right call?
Statistical significance is necessary but not sufficient. I'd ask: (1) What was the effect size and CI — is a 0.001pp lift worth shipping operationally? (2) What was the OEC (overall evaluation criterion) — was it pre-registered, or the best of 20 metrics checked? (3) Were guardrail metrics (latency, error rates, downstream engagement) also checked? (4) Was there any peeking — repeated looks at the data mid-experiment? (5) Was the test run long enough to cover novelty effects and day-of-week patterns? (6) Are there subgroup effects — does it harm any user segment? Ship decisions require all of these, not just p < 0.05.
02
PART I · PROBABILITY

Counting without tears

🎯Every counting question reduces to one of four cells in a 2×2 grid: ordered or not, replacement or not — place it in the grid first, then the formula writes itself.

Counting is the arithmetic engine underneath probability: once you can enumerate a sample space, you can divide to get probabilities. This chapter builds from the simplest rule (multiply) to permutations, combinations, stars-and-bars, and inclusion-exclusion — always example first, formula second. Classic interview traps (committees vs arrangements, "at least one" vs direct counting) are called out explicitly.

The multiplication rule — where everything starts

Concrete first. You pick an outfit: 3 shirts and 4 pants. How many outfits? Draw a tree: for each of the 3 shirts you have 4 pant choices, giving 3 × 4 = 12 outfits. You never need to list them all.

Now generalize. If task 1 has n₁ outcomes and, regardless of what happened in task 1, task 2 has n₂ outcomes, then the two-step process has n₁ · n₂ outcomes. The "regardless" is key — outcomes of different steps must be independent of each other for simple multiplication to work.

$$n_1 \times n_2 \times \cdots \times n_k$$
Multiply the number of choices at each independent step. Works only when the count at each step does not depend on prior choices.

Three books on a shelf — your running example. Call the books A, B, C. How many ways to arrange all three? You have 3 choices for position 1, then 2 remaining for position 2, then 1 for position 3: 3 × 2 × 1 = 6. The six arrangements: ABC, ACB, BAC, BCA, CAB, CBA. We'll revisit this throughout.

Permutations — ordered arrangements, no replacement

Concrete first. From the 3 books A, B, C, choose 2 and arrange them in a row. Your choices: AB, BA, AC, CA, BC, CB — that's 6 ordered pairs. Using the multiplication rule: 3 choices for spot 1, 2 remaining choices for spot 2 → 3 × 2 = 6.

General formula. Picking k items in order from n distinct items without replacement:

$$P(n,k) = \frac{n!}{(n-k)!} = n \cdot (n-1) \cdots (n-k+1)$$
P(n,k): "permutations of n things taken k at a time." The numerator n! would arrange all n items; dividing by (n−k)! cancels the items we don't use. The right-hand product has exactly k factors.

For our example: P(3,2) = 3!/1! = 6. Full arrangement of all 3: P(3,3) = 3! = 6 (confirming the earlier count).

⚠ Clears up: n! vs P(n,k)

n! = P(n,n) = arrangements of ALL n items. P(n,k) is arrangements of a SUBSET of size k. Forgetting the (n−k)! denominator when only choosing k ≤ n items is the most common permutation error.

Combinations — unordered selections, no replacement

Concrete first. From books A, B, C, choose 2 to put in a bag (order doesn't matter). The choices: {A,B}, {A,C}, {B,C} — just 3. Compare to the 6 ordered pairs above: we've divided by 2! = 2 because each unordered pair was counted twice (AB and BA are the same bag).

General formula. Choosing k items from n without replacement, order not mattering:

$$\binom{n}{k} = \frac{n!}{k!(n-k)!} = \frac{P(n,k)}{k!}$$
Read as "n choose k." Numerator n! arranges all n items; k! cancels the internal orderings within the chosen group (which don't count); (n−k)! cancels the unchosen items. Always an integer.

For our example: C(3,2) = 3!/(2!·1!) = 3. Confirm: {A,B}, {A,C}, {B,C}. ✓

Key symmetry: C(n,k) = C(n, n−k). Choosing which k to include is the same as choosing which n−k to exclude. C(10,3) = C(10,7) = 120.

Poker hands warm-up. A 52-card deck, 5-card hand, order doesn't matter: C(52,5) = 52!/(5!·47!) = 2,598,960 possible hands. That's the denominator for every poker probability.

📐 Permutation vs combination — the one-question test

Trigger: You're deciding which formula to use.

  1. Ask: "Does the order of selection matter here?" If picking a president, VP, and secretary from 10 people → YES (different people in different roles) → permutation P(10,3).
  2. If picking a 3-person committee from 10 people → NO (same 3 people in any order = same committee) → combination C(10,3).

Never: use C(n,k) when the problem specifies roles or positions — that's the single most common counting error.

The 4-case grid — place every problem here first

Counting problems come in exactly four flavors based on two binary questions: Is selection ordered? Is replacement allowed? The grid below maps each case to its formula and a canonical example.

With replacementWithout replacement
Ordered nk
Pick k items from n, each time putting it back; sequence matters.
Example: 4-digit PIN from digits 0–9: 104 = 10,000.
P(n,k) = n!/(n−k)!
Pick k items from n, no repeats; sequence matters.
Example: Gold/Silver/Bronze from 8 athletes: P(8,3) = 336.
Unordered C(n+k−1, k) ("stars and bars")
Pick k items from n types, repeats ok, order irrelevant.
Example: 3 donuts from 4 flavors: C(6,3) = 20.
C(n,k) = n!/(k!(n−k)!)
Pick k items from n, no repeats, order irrelevant.
Example: 5-card poker hand: C(52,5) = 2,598,960.
📐 Every counting question → 4-case grid first

Trigger: Any "how many ways" or "what is P(event)" combinatorics question.

  1. Ask: "Ordered or unordered?" (roles → ordered; groups/committees/sets → unordered)
  2. Ask: "With or without replacement?" (dice, coins → with; cards from a deck, people → without)
  3. Read off the formula from the matching cell.
  4. Sanity-check: for small n and k, can you enumerate by hand and confirm?

Never: reach for a formula before placing the problem in the grid.

Stars and bars — ordered with replacement, made visual

The donut problem. You walk into a shop with 4 flavors (glazed, chocolate, jelly, plain). You want to pick 3 donuts; you can pick any flavor more than once; the bag is unordered. How many distinct bags?

Represent each selection as 3 stars placed among 4 bins separated by 3 bars. Example: ★ | ★★ | | means 1 glazed, 2 chocolate, 0 jelly, 0 plain. The number of ways to arrange 3 stars and 3 bars among 6 positions is:

$$\binom{n+k-1}{k} = \binom{4+3-1}{3} = \binom{6}{3} = 20$$
n = number of distinct item types; k = number of items to pick. The formula counts sequences of k stars and n−1 separating bars, i.e., positions of the bars in a string of length n+k−1.

You can verify the small case: with 2 flavors and 2 donuts, C(3,2) = 3 bags: {GG, GC, CC}. ✓

⚠ Clears up: stars-and-bars vs multinomial

Stars-and-bars counts unordered multisets with replacement. The multinomial coefficient C(n; n₁,n₂,…,nk) = n!/(n₁!n₂!…nk!) counts ordered sequences (arrangements) with a fixed number of each type — e.g., arrangements of the letters in MISSISSIPPI. Different grid cell, different formula.

Inclusion-exclusion — counting with overlapping groups

The problem with "or." How many integers from 1 to 30 are divisible by 2 or by 3? Naively: 15 divisible by 2, 10 divisible by 3. Adding gives 25 — but that double-counts the 5 numbers divisible by 6. Correct answer: 15 + 10 − 5 = 20.

This is inclusion-exclusion for two sets: |A ∪ B| = |A| + |B| − |A ∩ B|. For three sets:

$$|A \cup B \cup C| = |A| + |B| + |C| - |A\cap B| - |A\cap C| - |B\cap C| + |A\cap B\cap C|$$
Add single sets, subtract all pairwise intersections (they were double-counted), add back triple intersections (they were over-subtracted). Pattern: alternate signs, odd-size collections get +, even-size get −.

Derangements — the classic application. A derangement is a permutation where no item is in its original position (think: a secret Santa where no one draws themselves). Let D(n) be the count. For n=3 books: total permutations = 6; the derangements of [A,B,C] are [B,C,A] and [C,A,B] — only 2.

Using inclusion-exclusion, the general formula is:

$$D(n) = n!\sum_{k=0}^{n}\frac{(-1)^k}{k!} \approx \frac{n!}{e}$$
Sum alternates signs; (−1)^k/k! are the inclusion-exclusion correction terms. As n grows, D(n)/n! → 1/e ≈ 0.368 — roughly 37% of random permutations are derangements regardless of n.

For n=3: D(3) = 3!(1 − 1 + 1/2 − 1/6) = 6 · (1/3) = 2. ✓

Classic traps — these appear directly in interviews
Trap 1: committee vs arrangement
"Choose a 3-person committee from 10" → C(10,3) = 120. "Elect a president, VP, and treasurer from 10" → P(10,3) = 720. A 6× difference. The signal: if titles or positions exist, it's ordered.
Trap 2: "at least one" → complement
P(at least one) = 1 − P(none). Never count "at least one" directly across cases. Example: P(at least one head in 5 flips) = 1 − (1/2)^5 = 31/32. Direct counting would require summing 5 cases (exactly 1, 2, 3, 4, 5 heads) — error-prone. Complement is one subtraction.
Trap 3: overcounting by symmetry
Circular arrangements: n objects in a circle have (n−1)! distinct arrangements (not n!), because rotations of the same arrangement are identical. Fix one object's position, then arrange the rest: (n−1)! ways.
Trap 4: birthday problem setup
P(all n people have distinct birthdays) = 365/365 · 364/365 · 363/365 · … · (365−n+1)/365. Each factor is "next person avoids all previous birthdays." P(at least one shared) = 1 − this product. At n=23, this exceeds 0.5 — surprising because people expect n ≈ 183.
⚠ The birthday problem intuition

The surprise is that you're counting pairs, not people. With 23 people there are C(23,2) = 253 pairs, each with a 1/365 chance of matching. 253/365 ≈ 0.69 — a rough indicator that a match is likely. (The exact calculation gives 0.507 at n=23.) Any interview question involving "how many people until probability of collision exceeds 50%" uses this logic — and it generalizes to hash tables, birthday attacks in cryptography, and duplicate-detection in data pipelines.

⚠ Clears up: "choose" notation pitfalls

C(n,0) = 1 (one way to choose nothing). C(n,n) = 1 (one way to choose everything). C(n,1) = n (n ways to choose one item). These three anchors let you sanity-check any combination formula: if your formula gives something other than these values at the extremes, you have an error.

Also: C(n,k) = 0 whenever k > n or k < 0. You can't choose more items than exist.

⚠ Clears up: sampling with vs without replacement in probability calculations

When items are drawn without replacement, the denominator of your probability calculation should also be "without replacement" — i.e., the number of ordered or unordered subsets, not the number of sequences with repeats. A common error: computing P(royal flush) using 52^5 in the denominator (with replacement) instead of C(52,5) = 2,598,960 (without replacement). The card game uses without replacement; always match numerator and denominator counting method.

✓ Remember
  • Four-cell grid: ordered/unordered × with/without replacement. Map the problem before touching a formula.
  • Permutations P(n,k) = n!/(n−k)!: ordered, no replacement. Combinations C(n,k) = n!/(k!(n−k)!): unordered, no replacement.
  • Stars-and-bars C(n+k−1,k): unordered, with replacement (multisets).
  • Inclusion-exclusion: |A∪B| = |A|+|B|−|A∩B|. "At least one" → complement.
  • D(n) ≈ n!/e — about 37% of permutations are derangements.
TL;DR

Counting is multiplication applied systematically. The 4-case grid (ordered vs unordered, replacement vs not) is the decision tree — land the problem in the right cell and the formula follows. Permutations count ordered subsets; combinations count unordered subsets; stars-and-bars handles unordered multisets. Inclusion-exclusion fixes overcounting in overlapping groups. The complement trick ("at least one" = 1 − P(none)) is almost always simpler than direct counting.

Tricky interview questions — chapter 02
Q1. In how many ways can 5 people sit in a row? In a circle?
Row: 5! = 120. Every person is in a distinct labeled position, so all orderings differ. Circle: fix one person to remove rotational symmetry, then arrange the remaining 4: (5−1)! = 4! = 24. The key insight is that in a circle, rotating everyone one seat produces the same physical arrangement; fixing one person's seat eliminates this overcounting. Grid placement: ordered (positions matter), without replacement → P(5,5) for row, (n−1)! for circle.
Q2. A password must be exactly 8 characters, each a lowercase letter or digit (36 characters total), with at least one digit. How many valid passwords are there?
Use the complement. Total 8-character strings (no restriction): 36^8. Strings with no digit (all letters, 26 choices each): 26^8. Valid passwords = 36^8 − 26^8 = 2,821,109,907,456 − 208,827,064,576 = 2,612,282,842,880. This is the classic "at least one of type X → complement" pattern. Direct counting would require summing over "exactly 1 digit," "exactly 2 digits," …, "exactly 8 digits" — 8 binomial terms with exponential factors, very error-prone.
Q3. How many 5-card poker hands contain exactly two pairs (and nothing better)?
Choose ranks for the two pairs: C(13,2) = 78 ways. Choose 2 suits for first pair: C(4,2) = 6. Choose 2 suits for second pair: C(4,2) = 6. Choose rank for the kicker (not matching either pair): 13−2 = 11 ranks. Choose suit for kicker: C(4,1) = 4. Total: 78 × 6 × 6 × 11 × 4 = 123,552. As a probability: 123,552/2,598,960 ≈ 4.75%. The subtlety is choosing the two pair-ranks with C(13,2) not P(13,2) — the two pairs are an unordered set of ranks.
Q4. How many arrangements of the letters in "MISSISSIPPI" are there?
Total letters: 11. Letter counts: M×1, I×4, S×4, P×2. This is a multinomial arrangement: divide total arrangements by factorials of repeated letters to avoid counting swaps of identical letters as new arrangements. Count = 11! / (1! × 4! × 4! × 2!) = 39,916,800 / (1 × 24 × 24 × 2) = 39,916,800 / 1,152 = 34,650. This is the "ordered with fixed counts" cell — a multinomial coefficient, not a standard combination.
Q5. You have 10 different books. In how many ways can you choose 3 for the left shelf and 3 for the right shelf (the two shelves are labeled; order on each shelf doesn't matter)?
Choose 3 books from 10 for the left shelf: C(10,3) = 120. Then choose 3 from the remaining 7 for the right shelf: C(7,3) = 35. Total: 120 × 35 = 4,200. Alternatively, this equals C(10,3) × C(7,3) = 10! / (3! × 3! × 4!) = 4,200. The key: the selections are sequential and the shelves are distinguishable (labeled), so we use two combinations multiplied together, not one. This pattern — sequential selections from a shrinking pool — appears in dealing cards to multiple players.
Q6. What is the birthday problem, and why does the answer surprise people?
With 23 people in a room, P(at least two share a birthday) > 50%. Most people expect this threshold to be around 183 (half of 365). The surprise comes from counting pairs: C(23,2) = 253 pairs, each independently having about a 1/365 chance of matching. The exact calculation: P(all distinct) = 365/365 × 364/365 × ... × 343/365 ≈ 0.493, so P(at least one match) ≈ 0.507. The lesson for ML interviews: collision probability grows with the square of the population (√365 ≈ 19), not linearly. This same "birthday paradox" governs hash collision rates and approximate nearest neighbor performance.
Q7. A bag has 5 red and 3 blue marbles. You draw 3 without replacement. What is P(exactly 2 red)?
Choose 2 red from 5: C(5,2) = 10. Choose 1 blue from 3: C(3,1) = 3. Total favorable: 10 × 3 = 30. Total ways to choose 3 from 8: C(8,3) = 56. P(exactly 2 red) = 30/56 = 15/28 ≈ 0.536. Grid placement: numerator = unordered, without replacement (combinations for each color), denominator = unordered, without replacement (total combinations). Consistency in the counting method is the key — both numerator and denominator must use the same convention.
Q8. You roll two dice. In how many outcomes does the sum equal 7? What is the probability?
Sample space: 6 × 6 = 36 equally likely ordered pairs (the dice are distinguishable). Pairs summing to 7: (1,6),(2,5),(3,4),(4,3),(5,2),(6,1) — exactly 6 outcomes. P(sum=7) = 6/36 = 1/6 ≈ 0.167. This is the most probable single sum for two dice; any other sum has fewer favorable outcomes. Note the grid: ordered (die 1 and die 2 are distinct positions), with replacement (each die independently shows 1–6) → n^k = 6^2 = 36 total. The ordering matters because (1,6) and (6,1) are distinct physical outcomes.
Q9. What is a derangement, and what fraction of all permutations are derangements as n → ∞?
A derangement of n items is a permutation where no item is in its original position — e.g., a secret Santa where nobody draws themselves. D(n) = n! × Σ(−1)^k / k! for k=0..n, which converges to n!/e as n → ∞. So the fraction D(n)/n! → 1/e ≈ 0.3679. This means about 37% of random permutations are derangements regardless of n for large n. Interview relevance: derangements appear in collision analysis, matching problems, and as a canonical inclusion-exclusion example. The 1/e limit also appears in the secretary problem (optimal stopping), making it doubly worth remembering.
Q10. A student randomly guesses on a 5-question multiple-choice test (4 options each). What is the probability of getting at least 3 correct?
Each question: P(correct) = 1/4, P(wrong) = 3/4. X ~ Binomial(n=5, p=1/4). P(X=k) = C(5,k) × (1/4)^k × (3/4)^(5−k). P(X=3) = C(5,3) × (1/4)^3 × (3/4)^2 = 10 × (1/64) × (9/16) = 90/1024. P(X=4) = C(5,4) × (1/4)^4 × (3/4)^1 = 5 × (1/256) × (3/4) = 15/1024. P(X=5) = C(5,5) × (1/4)^5 = 1/1024. Total: (90+15+1)/1024 = 106/1024 ≈ 10.35%. This is a Binomial application — Reflex 2 (name the distribution) converts a counting question into a formula evaluation.
Q11. How many ways can you distribute 7 identical candies among 3 children so that each child gets at least 1?
Give each child 1 candy first (to satisfy the "at least 1" constraint) — this uses 3 candies. Now distribute the remaining 4 identical candies among 3 children with no restriction (0 or more each). This is stars-and-bars: C(n+k−1, k) with n=3 types (children), k=4 candies: C(4+3−1, 4) = C(6,4) = 15 ways. The "give one to each first, then distribute remainder" trick converts "at least 1 per group" into an unconstrained stars-and-bars problem. Without the "at least 1" constraint, the answer would be C(7+3−1, 7) = C(9,7) = 36.
03
PART I · PROBABILITY

Axioms, conditioning, independence

🎯Conditional probability doesn't change the rules — it shrinks the universe.

This chapter builds the three pillars every probability calculation rests on: the axioms that make probability well-defined, the conditioning operation that lets you reason with partial information, and the independence property that makes large models tractable. Every later chapter in this page depends on these ideas — get them cold.

The sample space and events — with dice

Fix a running example we'll use throughout: roll two fair six-sided dice. Call them die 1 and die 2.

The sample space Ω is the set of all outcomes the experiment can produce. Here:

$$\Omega = \{(i,j) : i,j \in \{1,2,3,4,5,6\}\}$$
Every ordered pair — die 1 shows i, die 2 shows j. There are 6 × 6 = 36 equally likely outcomes.

An event is any subset of Ω — a collection of outcomes you want to lump together and ask "what is the chance of this?"

Event A
Sum equals 7 — the pairs (1,6),(2,5),(3,4),(4,3),(5,2),(6,1). Six outcomes.
Event B
Die 1 shows an even number — pairs where i ∈ {2,4,6}. Eighteen outcomes.
Event C
Both dice show the same face (doubles). Six outcomes: (1,1),…,(6,6).
Complement Ac
Sum is NOT 7. The other 30 outcomes.
The three axioms — probability in plain words

Probability is a function P that assigns a number to each event. Kolmogorov (1933) gave three rules that pin down every valid probability measure. Memorise the plain-words version; the formal version will follow naturally.

Axiom 1 — Non-negativity
No event has negative probability: P(A) ≥ 0 for every event A.
Axiom 2 — Normalisation
Something must happen: P(Ω) = 1. The total probability across all outcomes is 1.
Axiom 3 — Countable additivity
Disjoint events add up. If A and B share no outcomes (A ∩ B = ∅), then P(A ∪ B) = P(A) + P(B). Extend to any countable collection of pairwise-disjoint events.

From just these three axioms you can derive everything: the complement rule, inclusion-exclusion, the union bound, conditional probability. They are not arbitrary — they are the minimal rules that make probability coherent.

$$P(A^c) = 1 - P(A)$$
Derived from Axiom 2 and Axiom 3: Ω = A ∪ Ac and they're disjoint, so 1 = P(A) + P(Ac).

Two dice — quick sanity checks:

  • P(sum = 7) = 6/36 = 1/6. Six favourable outcomes, 36 equally likely total.
  • P(sum ≠ 7) = 1 − 1/6 = 5/6. Complement rule.
  • P(A ∪ C) where A = {sum=7} and C = {doubles}: A ∩ C = ∅ (no double sums to 7), so P(A ∪ C) = 6/36 + 6/36 = 12/36 = 1/3.
Conditional probability — shrink the universe

You roll the two dice behind a screen. A friend peeks and tells you "die 1 is a 3." Knowing this, what is the probability the sum is 7?

Before the hint, 36 outcomes were possible. After the hint, only 6 outcomes survive — those where die 1 = 3: (3,1),(3,2),(3,3),(3,4),(3,5),(3,6). Of these six, exactly one has sum 7: (3,4). So the answer is 1/6.

That reasoning is the entire mechanism of conditional probability. The hint shrinks the universe from 36 outcomes to 6, and you re-do your calculation inside the smaller universe.

Left: full 6×6 grid of two-dice outcomes; right: the same grid with all rows except "die 1 = 3" greyed out, showing the 6 surviving outcomes and the one that satisfies sum = 7.
$$P(A \mid B) = \frac{P(A \cap B)}{P(B)}, \quad P(B) > 0$$
P(A | B) = probability of A given B has occurred. Numerator: outcomes where both A and B happen. Denominator: outcomes where B happens. The ratio re-scales so B has probability 1 in the new, shrunken universe.

Dice example formally: A = {sum = 7}, B = {die 1 = 3}.

  • P(A ∩ B) = P({(3,4)}) = 1/36
  • P(B) = 6/36 = 1/6
  • P(A | B) = (1/36) / (1/6) = 1/6. Matches the counting argument.
📐 Conditional probability — the rule

Trigger: any question of the form "given X, what is the probability of Y?"

  1. Write down P(Y | X) = P(Y ∩ X) / P(X).
  2. Find both probabilities on the original (unconditional) sample space.
  3. Divide. Never redefine the sample space — just reweight it.

Never: confuse P(Y | X) with P(X | Y). These are generally not equal — that swap is the prosecutor's fallacy and costs people interviews.

The chain rule and why it matters

Rearranging the conditional probability formula gives the chain rule (also called the multiplication rule):

$$P(A \cap B) = P(A \mid B)\,P(B) = P(B \mid A)\,P(A)$$
The probability that both A and B occur equals the probability that B occurs times the conditional probability that A occurs given B. (Or the symmetric version with A and B swapped.)

This extends to any number of events:

$$P(A_1 \cap A_2 \cap \cdots \cap A_n) = P(A_1)\,P(A_2 \mid A_1)\,P(A_3 \mid A_1 \cap A_2)\,\cdots$$
Decompose a joint probability by conditioning on successively more events. The order of conditioning is arbitrary — pick the order that makes each factor easy to compute.

Dice example: P(die 1 = 3 AND sum = 7) = P(sum = 7 | die 1 = 3) · P(die 1 = 3) = (1/6)(1/6) = 1/36. ✓

The chain rule is the backbone of probabilistic graphical models and language model factorisation. GPT-style models generate text by computing P(tokent | token1, …, tokent-1) for each position — that is literally the chain rule applied left-to-right.

Independence — when knowledge is useless

After two dice are rolled, does knowing die 1 = 3 tell you anything about whether die 2 = 5? No — the dice don't talk to each other. Formally:

$$P(A \cap B) = P(A)\,P(B)$$
A and B are independent if and only if their joint probability equals the product of their marginal probabilities. Equivalently: P(A | B) = P(A) — knowing B gives zero information about A.

Check with dice: A = {die 1 = 3}, B = {die 2 = 5}.

  • P(A) = 1/6, P(B) = 1/6.
  • P(A ∩ B) = P({(3,5)}) = 1/36 = (1/6)(1/6). ✓ Independent.

Contrast: A = {sum = 7}, B = {die 1 = 3}.

  • P(A) = 1/6, P(B) = 1/6.
  • P(A ∩ B) = 1/36 = (1/6)(1/6). Equal — so sum = 7 and die 1 = 3 are independent!
  • Sanity check: P(sum = 7 | die 1 = 3) = 1/6 = P(sum = 7). Knowing die 1 doesn't change the probability of sum 7? Yes — because die 2 is equally likely to be any of {1,…,6}, and exactly one of those completes a sum of 7.

Non-independent events: A = {sum = 7}, D = {die 1 = 1}.

  • P(D) = 1/6, P(A) = 1/6, P(A ∩ D) = P({(1,6)}) = 1/36.
  • P(A | D) = (1/36)/(1/6) = 1/6 = P(A). Still independent! (Every specific value of die 1 leaves the same 1-in-6 chance of sum 7.)

Try: A = {sum = 7}, E = {sum = 8}. P(A) = 6/36, P(E) = 5/36, P(A ∩ E) = 0 (a roll can't sum to both 7 and 8). Product = 6/36 · 5/36 ≠ 0. Not independent. They are, however, mutually exclusive — which leads directly to the big confusion.

⚠ Clears up: independence vs mutual exclusivity

This is the most commonly confused pair in basic probability. They sound similar but mean almost opposite things.

PropertyMutual exclusivityIndependence
MeaningA and B cannot both happen: A ∩ B = ∅A and B carry no information about each other: P(A|B) = P(A)
Joint probabilityP(A ∩ B) = 0P(A ∩ B) = P(A)·P(B)
Dice example{sum=7} and {sum=8} — can't both occur{die 1 = 3} and {die 2 = 5} — no information flows
Can they co-exist?Only if one has probability 0Yes, generically
Intuition failurePeople think "separate events" = independentPeople think independent = can't happen together

The killer insight: if P(A) > 0 and P(B) > 0, then mutually exclusive events are never independent. Knowing A occurred (P(A) > 0 worth of probability happened) guarantees B did not — so B's probability given A is 0, not P(B). Information was conveyed. They are maximally dependent.

One-liner to remember: "Mutually exclusive events are the most dependent events there are — if one happens, the other definitely didn't."

Conditional independence — a subtler but critical notion

A and B are conditionally independent given C when:

$$P(A \cap B \mid C) = P(A \mid C)\,P(B \mid C)$$
Once you know C, knowing A gives no additional information about B. The conditioning event C "explains away" any dependence between A and B.

Naive Bayes classifiers assume that all features are conditionally independent given the class label. This is almost certainly false in reality, yet the model often works well in practice — understanding why requires conditional independence.

Dice micro-example: Let A = {die 1 is odd}, B = {die 2 is odd}, C = {sum is even}. Given that the sum is even, die 1 and die 2 must both be odd or both be even — knowing die 1 is odd forces die 2 to be odd. So A and B are not conditionally independent given C, even though they are marginally independent.

Law of total probability — case splitting

Suppose you want P(some event A) but it's hard to compute directly. The trick: find a partition of Ω — a set of mutually exclusive, exhaustive events B1, B2, …, Bk that cover everything — then compute P(A | Bi) separately for each case and average.

$$P(A) = \sum_{i=1}^{k} P(A \mid B_i)\,P(B_i)$$
Total probability of A = weighted average of its conditional probabilities across the partition, where the weights are the unconditional probabilities of each Bi. This works because the Bi cover the whole space with no overlap.
📐 P(weird event) — the case-split rule

Trigger: you want P(A) and direct counting is messy.

  1. Ask: "What would I wish I knew before computing P(A)?" Name that hidden variable B.
  2. List the possible values of B: B = b1, b2, …
  3. Compute P(A | B = bi) for each value — these should be easy.
  4. Weight by P(B = bi) and sum: P(A) = Σ P(A|B=bi)·P(B=bi).

Never: skip the partition check. If the Bi are not exhaustive, you'll undercount probability mass.

Dice worked example: A = {sum is prime}. Primes in range 2–12: {2, 3, 5, 7, 11}. Partition on die 1 value (Bi = {die 1 = i}):

  • Die 1 = 1: need die 2 ∈ {1,2,4,6} (sum ∈ {2,3,5,7}). Count: 4. P(A|die1=1) = 4/6.
  • Die 1 = 2: need die 2 ∈ {1,3,5} (sum ∈ {3,5,7}). Count: 3. P(A|die1=2) = 3/6.
  • Die 1 = 3: need die 2 ∈ {2,4} (sum ∈ {5,7}). Count: 2. P(A|die1=3) = 2/6.
  • Die 1 = 4: need die 2 ∈ {1,3} (sum ∈ {5,7}). Count: 2. P(A|die1=4) = 2/6.
  • Die 1 = 5: need die 2 ∈ {2,6} (sum ∈ {7,11}). Count: 2. P(A|die1=5) = 2/6.
  • Die 1 = 6: need die 2 ∈ {1,5} (sum ∈ {7,11}). Count: 2. P(A|die1=6) = 2/6.

P(A) = (1/6)(4+3+2+2+2+2)/6 = 15/36 = 5/12. (We can verify by counting directly: 15 prime-sum pairs out of 36.)

Worked: at least one six in 4 rolls — and the Chevalier de Méré

Direct counting of "at least one six" is error-prone — you'd have to subtract overlaps. The complement approach is the standard tool.

Complement rule: P(at least one 6 in 4 rolls) = 1 − P(no 6 in 4 rolls).

Each roll is independent. P(no 6 on one roll) = 5/6. So:

$$P(\text{at least one 6 in 4 rolls}) = 1 - \left(\frac{5}{6}\right)^4 = 1 - \frac{625}{1296} \approx 0.518$$
Slightly better than even odds. The four rolls are independent so the probability of no six in all four multiplies as (5/6) four times.

The Chevalier de Méré story (3 sentences): In 1654, the French gambler Antoine Gombaud (Chevalier de Méré) bet that he'd see at least one double-six in 24 rolls of two dice, reasoning it should be as likely as one six in four rolls of one die. He was losing money and asked Blaise Pascal why. Pascal showed him that P(at least one double-six in 24 rolls) = 1 − (35/36)²⁴ ≈ 0.491 — just under 50% — while the analogous single-die game had probability ≈ 0.518, and the two events are not equivalent because the probability of a double-six (1/36) differs from the probability of a six (1/6) in a way that doesn't scale linearly with the number of rolls.

$$P(\text{at least one } (6,6) \text{ in 24 rolls}) = 1 - \left(\frac{35}{36}\right)^{24} \approx 0.491$$
35/36 is the probability of NOT getting double-six on one roll of two dice. Twenty-four independent rolls, each with (35/36) probability of failure.

This correspondence between Pascal and Fermat founded modern probability theory. The lesson for interviews: when you see "at least one," flip to the complement. The complement is almost always one clean product.

📌 "At least one" — always use the complement

P(at least one event happens in n trials) = 1 − P(event fails on all n trials) = 1 − (1−p)^n when trials are independent with per-trial probability p. Never try to add up the direct cases — the overlaps will kill you.

✓ Remember
  • Sample space = all possible outcomes; event = subset of sample space. Three axioms: non-negativity, normalisation, countable additivity.
  • P(A | B) = P(A ∩ B) / P(B) — shrink the universe to outcomes where B occurred.
  • Independence: P(A ∩ B) = P(A)P(B). Mutual exclusivity: P(A ∩ B) = 0. Two positive-probability events cannot be both. Mutually exclusive events are maximally dependent.
  • Law of total probability: partition on what you wish you knew, average the conditionals. "At least one" → complement.
◆ Interview probe

"Are independent events mutually exclusive?" — the trap. Answer: no, if both have positive probability. State the definition of each, give the two-dice example, and note that mutual exclusivity implies dependence (not independence).

"What is conditional probability geometrically?" — describe the shrink-the-universe picture: you restrict the sample space to outcomes where the condition holds, then re-normalise.

Tricky interview questions — chapter 03
Q1. State the three axioms of probability and derive the complement rule from them.
Axiom 1: P(A) ≥ 0 for all events A. Axiom 2: P(Ω) = 1. Axiom 3: if A ∩ B = ∅ then P(A ∪ B) = P(A) + P(B). Derivation: write Ω = A ∪ Ac. These two sets are disjoint and their union is Ω. By Axiom 3, P(Ω) = P(A) + P(Ac). By Axiom 2, P(Ω) = 1. Therefore P(Ac) = 1 − P(A). The proof is two lines; the skill is knowing which axioms to invoke.
Q2. You roll two fair dice. What is P(sum = 8 | die 1 is even)?
Let A = {sum = 8}, B = {die 1 is even}. P(B) = 18/36 = 1/2. The pairs (i, j) with die 1 even and sum 8 are: (2,6),(4,4),(6,2) — three pairs. P(A ∩ B) = 3/36 = 1/12. P(A | B) = (1/12) / (1/2) = 1/6. The direct approach: given die 1 ∈ {2,4,6} (18 outcomes), count those with sum 8: (2,6),(4,4),(6,2) = 3 outcomes. 3/18 = 1/6. Both methods agree. The key step is clearly identifying what outcomes survive the conditioning event.
Q3. Are {sum = 7} and {die 1 = 4} independent? Prove it.
A = {sum = 7}: outcomes (1,6),(2,5),(3,4),(4,3),(5,2),(6,1), so P(A) = 6/36 = 1/6. B = {die 1 = 4}: six outcomes, P(B) = 6/36 = 1/6. A ∩ B = {(4,3)}, P(A ∩ B) = 1/36. Product P(A)·P(B) = (1/6)(1/6) = 1/36. Equal — so yes, independent. Intuition: conditioning on die 1 tells you die 2 must be 7 − die 1 to complete a sum of 7. Whatever die 1 is, there's exactly one die 2 value that works, so P(sum=7 | die 1 = k) = 1/6 for any k. This is why die 1 gives no information about whether the sum is 7.
Q4. If A and B are mutually exclusive and both have positive probability, are they independent? Why or why not?
No — they are maximally dependent, not independent. Independence requires P(A ∩ B) = P(A)·P(B). But mutual exclusivity means P(A ∩ B) = 0. If P(A) > 0 and P(B) > 0, then P(A)·P(B) > 0 ≠ 0, so the equality fails. Intuitively: knowing A occurred tells you immediately that B did not — that's a massive information transfer, the opposite of independence. "Mutually exclusive events with positive probability are always dependent."
Q5. Use the law of total probability to find P(sum of two dice is odd).
Partition on die 1. If die 1 is odd (probability 1/2), sum is odd iff die 2 is even (probability 1/2): contribution = (1/2)(1/2) = 1/4. If die 1 is even (probability 1/2), sum is odd iff die 2 is odd (probability 1/2): contribution = (1/2)(1/2) = 1/4. Total = 1/4 + 1/4 = 1/2. Alternatively by direct count: 18 of 36 outcomes have an odd sum (odd+even or even+odd). The law-of-total-probability route is worth showing because it generalises to cases where direct counting is infeasible.
Q6. What is P(at least one 6 in 6 rolls of a fair die)? Is this greater than, equal to, or less than 2/3?
P(at least one 6 in 6 rolls) = 1 − (5/6)^6 = 1 − 15625/46656 ≈ 1 − 0.335 = 0.665. This is slightly less than 2/3 ≈ 0.667. The common (wrong) intuition says "six rolls, one-sixth chance each, so probability 1" — that double-counts. The correct answer uses the complement. The value ≈ 0.665 is just below 2/3, a fact that surprised many historical gamblers who incorrectly thought 6 rolls guaranteed a six.
Q7. A bag has 3 red and 2 blue marbles. You draw two without replacement. What is P(both red)?
Use the chain rule: P(both red) = P(first red) · P(second red | first red) = (3/5) · (2/4) = (3/5)(1/2) = 3/10. After drawing one red marble, 4 marbles remain: 2 red, 2 blue, so P(second red | first red) = 2/4 = 1/2. The chain rule is the right framework here because the draws are dependent — the composition changes after each draw. Alternatively: C(3,2)/C(5,2) = 3/10 using counting. Both methods give 3/10.
Q8. Define conditional independence. Give an example where A and B are marginally dependent but conditionally independent given C.
A and B are conditionally independent given C if P(A ∩ B | C) = P(A|C)·P(B|C), or equivalently P(A|B,C) = P(A|C). Classic example: let C = whether a student studies hard (yes/no), A = student passes the exam, B = student gets an A on the homework. Marginally, A and B are positively correlated (students who do well on homework tend to pass exams). But given that the student studies hard (C = yes), A and B may be roughly independent — studying explains both outcomes, removing the correlation. In Naive Bayes: features (words in an email) are assumed conditionally independent given the class label (spam vs. not spam). The class label "explains" the correlations between words.
Q9. Suppose P(A) = 0.4, P(B) = 0.5, P(A | B) = 0.6. Find P(A ∩ B), P(B | A), P(A ∪ B).
P(A ∩ B) = P(A|B)·P(B) = 0.6 × 0.5 = 0.30. P(B|A) = P(A ∩ B)/P(A) = 0.30/0.40 = 0.75. P(A ∪ B) = P(A) + P(B) − P(A ∩ B) = 0.4 + 0.5 − 0.3 = 0.6. Check independence: P(A)·P(B) = 0.2 ≠ 0.3 = P(A ∩ B), so A and B are not independent — knowing B makes A more likely (0.6 vs 0.4 base rate). These three derived quantities — joint, reverse conditional, union — appear constantly in interview setups.
Q10. The Chevalier de Méré believed P(≥1 six in 4 rolls) = P(≥1 double-six in 24 rolls). Why is this wrong, and what are the actual values?
P(≥1 six in 4 rolls) = 1 − (5/6)^4 ≈ 0.518. P(≥1 double-six in 24 rolls) = 1 − (35/36)^24 ≈ 0.491. The de Méré reasoning was: "one die has 1/6 chance of a six; two dice have 1/36 chance of double-six, which is 6× smaller; so play 6× as many rolls — 24 — to compensate." This is wrong because the complement (5/6 vs 35/36) doesn't scale symmetrically. Specifically, 4×(1/6) = 4/6 but 1−(5/6)^4 ≠ 4/6, and 24×(1/36) = 24/36 but 1−(35/36)^24 ≠ 24/36. The expected number of successes scales linearly, but the probability of at least one does not. This correspondence between Pascal and Fermat in 1654 effectively founded probability theory.
Q11. In a machine learning context, where does the chain rule of probability appear?
The chain rule P(x₁,…,xₙ) = P(x₁)·P(x₂|x₁)·P(x₃|x₁,x₂)·… is the mathematical foundation of autoregressive language models (GPT, etc.). Each token is generated by computing P(token_t | all previous tokens), and the joint probability of a sequence is the product of these conditional probabilities. Bayesian networks factorise a joint distribution using the chain rule, with the graph structure encoding which variables we condition on. Naive Bayes classifiers use the chain rule P(y,x₁,…,xₙ) = P(y)·∏P(xᵢ|y) — the chain rule plus conditional independence. Knowing the chain rule cold lets you speak fluently about all of these.
Q12. P(A) = 0.3 and P(B) = 0.4. What is the range of possible values for P(A ∪ B)?
P(A ∪ B) = P(A) + P(B) − P(A ∩ B). We need the range of P(A ∩ B). Minimum: P(A ∩ B) ≥ max(0, P(A)+P(B)−1) = max(0, 0.7−1) = 0 — occurs when A and B are disjoint (mutually exclusive). Maximum: P(A ∩ B) ≤ min(P(A), P(B)) = 0.3 — occurs when A ⊆ B. So P(A ∪ B) ranges from 0.3+0.4−0.3 = 0.4 (when A ⊆ B) to 0.3+0.4−0 = 0.7 (when A and B are disjoint). The range is [0.4, 0.7]. This Fréchet bound question tests whether candidates understand that the joint probability has constraints even without knowing the dependence structure.
TL;DR

Probability is defined by three axioms; everything else is derived. Conditioning shrinks the universe — P(A|B) = P(A∩B)/P(B). Independence means information doesn't flow — P(A∩B) = P(A)P(B). Mutual exclusivity (P(A∩B)=0) is the opposite of independence when both events have positive probability. The law of total probability says: partition on what you wish you knew, then average. "At least one" problems flip to the complement.

04
PART I · PROBABILITY

Bayes' rule, done properly

🎯Bayes' rule is just two conditional probabilities set equal — the surprise is how much it fixes your intuition about test results.

Bayes' rule converts "what is the probability of a positive test given the disease?" into "what is the probability of the disease given a positive test?" — the question you actually care about. This chapter derives the rule from first principles, builds the 10,000-person counting table that makes base-rate neglect impossible, introduces the odds form for fast mental arithmetic, and connects the whole machinery to Naive Bayes, MAP estimation, and Bayesian A/B testing.

Derivation — two lines from conditional probability

From Chapter 3, we know that for any two events A and B:

$$P(A \cap B) = P(A \mid B)\,P(B) = P(B \mid A)\,P(A)$$
The joint probability P(A and B) can be written two ways using the chain rule — either condition on B first or on A first. Both expressions equal the same thing: P(A ∩ B).

Set the two right-hand sides equal and divide both sides by P(B):

$$P(A \mid B) = \frac{P(B \mid A)\,P(A)}{P(B)}$$
This is Bayes' rule. P(A) is the prior — your belief before seeing B. P(B|A) is the likelihood — how probable the evidence B is if A were true. P(B) is the marginal likelihood — how probable B is overall (often expanded via total probability). P(A|B) is the posterior — your belief after seeing B.

There is no magic here — it is literally just rearranging the chain rule. The insight is in what you choose to call A and B.

Prior P(A)
Your belief about A before observing any evidence. In medical testing: the prevalence of the disease in the population.
Likelihood P(B | A)
How probable is the observed evidence B, assuming A is true? In medical testing: the true positive rate (sensitivity) of the test.
Marginal likelihood P(B)
The unconditional probability of evidence B, averaging over all possible states of A. Computed via the law of total probability: P(B) = P(B|A)P(A) + P(B|Ac)P(Ac).
Posterior P(A | B)
Your updated belief about A after observing B. This is the answer you wanted.
The medical test — why your intuition is wrong

A disease affects 1% of the population. A test for this disease has:

  • Sensitivity (true positive rate): 99% — if you have the disease, the test is positive 99% of the time.
  • Specificity (true negative rate): 95% — if you don't have the disease, the test is negative 95% of the time. Equivalently, the false positive rate is 5%.

You test positive. What is the probability you have the disease?

Most people, including many physicians in studies, say "about 99%" — because the test is "99% accurate." The actual answer is about 17%. Here is why.

The 10,000-person counting table — no algebra required

Instead of manipulating fractions, imagine 10,000 people drawn randomly from the population and trace them through the test.

Test positive (+)Test negative (−)Total
Have disease 99
true positives
1
false negatives
100 (1% of 10,000)
No disease 495
false positives
9,405
true negatives
9,900 (99% of 10,000)
Total 594 9,406 10,000

How to read the table:

  • 100 people have the disease. The test catches 99% of them → 99 true positives, 1 false negative.
  • 9,900 people don't have the disease. The test wrongly flags 5% of them → 495 false positives, 9,405 true negatives.
  • Total positive tests: 99 + 495 = 594.
  • Of these 594 positive tests, only 99 are real disease cases.
$$P(\text{disease} \mid +) = \frac{99}{594} \approx 0.167$$
Of everyone who tests positive, count how many actually have the disease. 99 true positives out of 594 total positives. About 1 in 6. Not 99 in 100.

The base-rate punchline: The 495 false positives swamp the 99 true positives because the disease is rare. A positive test shifts your probability from 1% (the prior) to ~17% (the posterior) — a 17× increase, but still well short of certainty. If the disease had 10% prevalence, the posterior would be ~69%. The prior matters enormously.

The 10,000-person 2×2 table, annotated to show how prior prevalence and false-positive rate jointly determine the posterior probability of disease given a positive test result.
⚠ Clears up: base-rate neglect

The error: focusing only on the sensitivity (99%) and ignoring that the disease is rare (1%). People mentally replace "P(disease | +)" with "P(+ | disease)" — an instance of the prosecutor's fallacy (confusing the probability of evidence given guilt with the probability of guilt given evidence).

Why it happens: the likelihood P(+ | disease) = 0.99 is a vivid number. The prior P(disease) = 0.01 is abstract. Human minds anchor on the vivid number and fail to multiply by the base rate.

The fix: always build the 2×2 table with a large round number (10,000 works well). Fill in the four cells using the prior and the two error rates. Read off the conditional probabilities directly from the cell counts. No formula, no algebra, no errors.

ML relevance: this exact phenomenon appears in anomaly detection, fraud detection, and rare-event classifiers. When the positive class is rare (0.1% fraud rate), even a 99% accurate model produces mostly false positives among its flagged predictions. Always check precision, not just recall or accuracy.

📐 "Given a positive test" — the 10,000-person rule

Trigger: any question giving you a base rate, a sensitivity (true positive rate), and a false positive rate, asking for P(condition | evidence).

  1. Write down N = 10,000 (or any large round number).
  2. Fill the disease row: N × prevalence people have the disease. Of these, sensitivity × (N × prevalence) test positive.
  3. Fill the no-disease row: N × (1 − prevalence) people don't. Of these, false positive rate × N × (1 − prevalence) test positive.
  4. Add the two positive cells to get total positives.
  5. Divide true positives by total positives. Done.

Never: try to do this in your head with fractions and give up. The table is foolproof and fast to sketch on a whiteboard.

The odds form — Bayes as a multiplier

There is a cleaner version of Bayes' rule that makes mental updating instant. Define the odds of an event A as:

$$\text{Odds}(A) = \frac{P(A)}{P(A^c)} = \frac{P(A)}{1 - P(A)}$$
Odds converts a probability into a ratio. P = 1/4 → odds = 1:3. P = 1/2 → odds = 1:1. P = 2/3 → odds = 2:1. Odds greater than 1 means more likely than not. The inverse is P(A) = odds/(1+odds).

Now write Bayes' rule for A vs Ac both conditioned on evidence B:

$$\underbrace{\frac{P(A \mid B)}{P(A^c \mid B)}}_{\text{posterior odds}} = \underbrace{\frac{P(B \mid A)}{P(B \mid A^c)}}_{\text{likelihood ratio}} \times \underbrace{\frac{P(A)}{P(A^c)}}_{\text{prior odds}}$$
Posterior odds = likelihood ratio × prior odds. The P(B) denominator cancels. The likelihood ratio (also called the Bayes factor) measures how much more probable the evidence is under A than under Ac. Multiply the prior odds by this ratio to get the posterior odds.

Medical test in odds form:

  • Prior odds = P(disease)/P(no disease) = 0.01/0.99 ≈ 1:99.
  • Likelihood ratio = P(+|disease)/P(+|no disease) = 0.99/0.05 = 19.8.
  • Posterior odds = 19.8 × (1/99) ≈ 0.2 = 1:5.
  • Posterior probability = 1/(1+5) ≈ 0.167. ✓ Matches the table.

The odds form is the fast mental tool: the likelihood ratio is a single number that tells you how strongly the evidence shifts your belief. A likelihood ratio of 1 means the evidence is neutral. Greater than 1 means evidence supports A. Less than 1 means evidence supports Ac.

LR = 1
Evidence is neutral — posterior odds = prior odds.
LR = 10
Evidence is strongly in favour of A — posterior odds are 10× the prior.
LR = 0.1
Evidence is strongly against A — posterior odds are 1/10 the prior.
LR = 19.8 (our test)
A positive result is 19.8× more likely if you have the disease than if you don't. Strong evidence — but the disease is so rare that the posterior is still only 17%.
Sequential updating — multiply likelihood ratios

Suppose you run a second, independent test on the same patient (a different biomarker with LR = 25 for a second positive). You don't need to redo the full Bayes calculation from scratch. The posterior odds after the first test become the new prior odds, and you multiply by the second likelihood ratio.

$$\text{Posterior odds}_{2} = LR_2 \times LR_1 \times \text{Prior odds}$$
Each independent piece of evidence multiplies the current odds by its likelihood ratio. The order doesn't matter. This is the multiplicative version of "update your beliefs as you see data."

After test 1 (LR₁ = 19.8): odds ≈ 1:5, probability ≈ 17%.

After test 2 (LR₂ = 25): odds = 25 × (1/5) = 5:1, probability ≈ 5/6 ≈ 83%.

Two positive tests on a rare disease go from 17% to 83% — still not certainty, but now you'd treat.

Spam filter as sequential Bayes: A Naive Bayes spam classifier does exactly this. Each word in an email is a "test." The prior P(spam) is the base rate of spam in your inbox. Each word w updates the odds by a likelihood ratio P(w|spam)/P(w|not spam). Words like "Viagra" have very high LR (much more common in spam); words like "meeting" have LR ≈ 1 (equally common in both). The model multiplies all these likelihood ratios together, which in log space is a sum — and that's why Naive Bayes classifiers produce a linear function of log word counts.

Where this becomes ML: Naive Bayes, MAP estimation, Bayesian A/B

Bayes' rule is not just a medical-testing curiosity — it is the backbone of three major ML concepts that appear in nearly every interview at a top lab.

Naive Bayes classifiers. Given a document with words w1, …, wn, the classifier computes:

$$P(\text{class} \mid w_1,\ldots,w_n) \propto P(\text{class}) \prod_{i=1}^{n} P(w_i \mid \text{class})$$
The prior P(class) is the base rate of that class. Each factor P(wᵢ | class) is a likelihood ratio contribution. The "naive" assumption is that words are conditionally independent given the class — wrong in reality but surprisingly effective in practice. The product of likelihoods is exactly the sequential Bayes update from above, applied to all n words at once.

Training: count word frequencies per class. Inference: multiply the prior by each word's likelihood. Add-one (Laplace) smoothing prevents zero probabilities for unseen words. In log space the product becomes a sum — making Naive Bayes a linear classifier in log word counts.

MAP estimation. Maximum A Posteriori estimation picks the parameter θ that maximises the posterior:

$$\hat\theta_{\text{MAP}} = \arg\max_\theta \; P(\theta \mid \text{data}) = \arg\max_\theta \; \bigl[P(\text{data} \mid \theta) + \log P(\theta)\bigr]$$
The log posterior = log likelihood + log prior. Maximising it is identical to maximising the likelihood but with an extra penalty term from the prior. A Gaussian prior on θ gives the log prior = −‖θ‖²/(2σ²), which is exactly L2 regularisation (ridge). A Laplace prior gives −|θ|/b, which is L1 regularisation (LASSO). Regularisation IS Bayesian MAP estimation with a specific prior.

Concrete example: logistic regression with L2 regularisation is MAP estimation under a Gaussian prior on the weights. The strength λ of regularisation equals 1/σ² where σ² is the prior variance. Stronger prior (smaller σ) → higher λ → more regularisation → weights shrunk harder toward zero.

Bayesian A/B testing. Instead of a p-value, compute P(variant B is better than variant A | observed data) directly. Start with a prior over conversion rates (e.g., Beta(1,1) for ignorance), observe conversions, update to a Beta posterior, and compute the probability that the posterior draw from B exceeds the draw from A. There is no fixed sample size, no peeking problem in the classical sense — you just read off posterior probabilities. Chapter 14 covers this in full; the Bayes machinery here is the foundation.

⚠ Clears up: P(A|B) vs P(B|A) — the prosecutor's fallacy

The error: swapping a conditional probability with its reverse. "The test is 99% accurate" (P(+|disease) = 0.99) is not the same as "99% chance you have the disease given a positive test" (P(disease|+)). These two quantities can be wildly different when the prior is lopsided.

In court: "The probability that an innocent person matches the DNA evidence is 1 in a million" does NOT mean "the probability this defendant is innocent is 1 in a million." If there are 10 million people in the city, you'd expect 10 innocent matches — the prior of innocence (nearly 1) has to be multiplied through.

In ML: precision ≠ recall. A model with 95% recall (P(predict positive | actually positive)) can still have low precision (P(actually positive | predict positive)) when the positive class is rare. Always check both, and draw the 2×2 confusion matrix.

◆ Interview probe

"Walk me through Bayes' rule on a concrete example." — Don't just state the formula. Build the 2×2 table with 10,000 people, read off the answer, then derive it algebraically to show you understand both representations. State what the prior, likelihood, and posterior are in your specific problem.

"Why is regularisation equivalent to MAP?" — State that the log posterior = log likelihood + log prior. Show that L2 regularisation emerges from a Gaussian prior and L1 from a Laplace prior. Give λ = 1/σ² explicitly. This is a Staff-level question that separates people who memorised regularisation from those who understand it.

"Why does Naive Bayes work despite the independence assumption being false?" — The classification boundary only requires that the posterior ranking of classes is correct, not that the posterior probabilities are calibrated. Even with wrong probabilities, the argmax often picks the right class. Naive Bayes gives poorly calibrated probabilities but surprisingly good decisions.

✓ Remember
  • Bayes' rule = two forms of the joint probability set equal; posterior ∝ likelihood × prior.
  • Base-rate neglect: low prevalence + non-zero false positive rate → most positives are false. Always build the table.
  • Odds form: posterior odds = likelihood ratio × prior odds. LR is the key number — how much more probable is the evidence under H than under ¬H.
  • Sequential updating: multiply likelihood ratios. Log odds form turns this into a sum — that's Naive Bayes and the intuition behind log-loss.
  • MAP = MLE + log prior = regularised MLE. Gaussian prior → L2; Laplace prior → L1.
📐 Any "given a positive test / given the evidence" question — the rule

Trigger: you see a base rate (prevalence, prior), a sensitivity or true positive rate, and a false positive rate; you must find P(condition | observed evidence).

  1. Write N = 10,000.
  2. Diseased row: N × prevalence people; of those, sensitivity × that many test positive.
  3. Healthy row: N × (1 − prevalence) people; of those, false positive rate × that many test positive.
  4. Posterior = (true positives) / (true positives + false positives).
  5. State the answer and note: "the low posterior is driven by the low prior, not test quality."

Bonus: convert to odds form to verify: LR = sensitivity / false positive rate; posterior odds = LR × prior odds.

Never: quote the sensitivity as the answer (P(+|disease) ≠ P(disease|+)); never forget to state the prior.

TL;DR

Bayes' rule is rearranged conditional probability: posterior ∝ likelihood × prior. The 10,000-person table makes base-rate neglect impossible — build it whenever prevalence appears. The odds form (posterior odds = LR × prior odds) is the fast mental tool; sequential Bayes multiplies likelihood ratios and underlies Naive Bayes classifiers. MAP estimation IS regularised MLE: Gaussian prior → L2, Laplace → L1. Every "given a positive test" question, every spam filter, and every regularised model in ML is running Bayes' rule under the hood.

Tricky interview questions — chapter 04
Q1. A disease has 0.5% prevalence. A test has 90% sensitivity and 8% false positive rate. You test positive. What is P(disease | +)?
Use the 10,000-person table. Diseased: 50 people; 90% test positive → 45 true positives, 5 false negatives. Healthy: 9,950 people; 8% test positive → 796 false positives. Total positives: 45 + 796 = 841. Posterior = 45/841 ≈ 5.3%. Despite 90% sensitivity, the posterior is just 5% because prevalence is low and the false positive rate is high relative to the base rate. The likelihood ratio is 0.9/0.08 = 11.25; prior odds = 0.005/0.995 ≈ 1/199; posterior odds ≈ 11.25/199 ≈ 0.0565; posterior ≈ 5.3%. Both methods agree.
Q2. What is the difference between P(A | B) and P(B | A)? Give an example where confusing them leads to a wrong answer.
P(A|B) is the probability of A given B is observed; P(B|A) is the reverse. They can be radically different. Classic example: P(positive test | disease) = 0.99 (sensitivity), but P(disease | positive test) = 0.17 (as computed above) when prevalence is 1%. The confusion is called the prosecutor's fallacy or the base-rate fallacy. In court: "the chance an innocent person matches the DNA is 1 in a million" is P(match | innocent) — it says nothing directly about P(innocent | match) without knowing how many potential suspects exist. In ML: high recall P(predict 1 | actually 1) is not the same as high precision P(actually 1 | predict 1).
Q3. Explain the odds form of Bayes' rule and why it is useful.
The odds form states: posterior odds = likelihood ratio × prior odds, where the likelihood ratio (LR) is P(evidence | H) / P(evidence | ¬H). The P(evidence) denominator cancels. The LR is the key number — it tells you how much the evidence shifts the odds, independent of the prior. Why useful: (1) mental arithmetic — multiply a single number by current odds; (2) sequential updates — multiply LRs from independent tests; (3) the sign tells direction immediately (LR > 1 supports H, LR < 1 opposes H). For the medical test: LR = 0.99/0.05 = 19.8, meaning a positive result makes disease ~20× more odds-likely than before. Still only 17% posterior because the prior was so low (1%).
Q4. You run two independent tests on a patient. Test 1: LR = 15 (positive). Test 2: LR = 0.2 (negative). Prior prevalence is 2%. What is the posterior probability of disease?
Prior odds = 0.02/0.98 ≈ 0.0204. After test 1 (positive, LR = 15): posterior odds = 15 × 0.0204 = 0.306. After test 2 (negative, LR = 0.2): posterior odds = 0.2 × 0.306 = 0.0612. Posterior probability = 0.0612 / (1 + 0.0612) ≈ 5.8%. Key insight: a negative result on test 2 (LR = 0.2) means the evidence is 5× less consistent with disease than with no disease — it partially undoes the boost from test 1. The order of updates does not matter; the final odds are the same either way.
Q5. How does MAP estimation relate to regularised maximum likelihood? Be concrete about L1 and L2.
MAP finds the mode of the posterior: θ̂_MAP = argmax P(θ|data) = argmax [log P(data|θ) + log P(θ)]. The first term is the log likelihood (what MLE maximises). The second term is the log prior, which acts as a regulariser. A Gaussian prior P(θ) ∝ exp(−‖θ‖²/(2σ²)) gives log P(θ) = −‖θ‖²/(2σ²) + const, which is exactly L2 regularisation with λ = 1/σ². A Laplace prior P(θ) ∝ exp(−|θ|/b) gives log P(θ) = −|θ|/b + const, which is L1 regularisation (LASSO) with λ = 1/b. Therefore: (1) L2 regularisation = MAP with Gaussian prior (shrinks weights toward zero, never exactly zero); (2) L1 regularisation = MAP with Laplace prior (promotes sparsity — the Laplace prior has a sharp peak at zero that actually forces some weights to exactly zero). Stronger regularisation = stronger prior = less data-driven = more shrinkage.
Q6. Why is Naive Bayes called "naive," and why does it work anyway despite the violated assumption?
It is called naive because it assumes features (words, pixels, attributes) are conditionally independent given the class: P(w₁,…,wₙ|class) = ∏ P(wᵢ|class). This is almost always false — "machine" and "learning" co-occur far more than independence would predict. Why it works: the classifier only needs to correctly rank the posterior probabilities across classes (pick the argmax), not to get the posteriors right numerically. The decision boundary can still be correct even when the individual probabilities are badly off. This is why Naive Bayes is a strong baseline for text classification despite the wrong assumption. However, it produces badly calibrated probability estimates — do not use raw Naive Bayes probabilities as confidence scores without post-hoc calibration (e.g., Platt scaling).
Q7. A spam filter sees the word "Viagra" in an email. The prior P(spam) = 0.3. P("Viagra"|spam) = 0.15, P("Viagra"|not spam) = 0.001. What is P(spam | "Viagra" appears)?
Prior odds = 0.3/0.7 ≈ 0.4286. LR = P("Viagra"|spam) / P("Viagra"|not spam) = 0.15/0.001 = 150. Posterior odds = 150 × 0.4286 ≈ 64.3. Posterior probability = 64.3/(1 + 64.3) ≈ 98.5%. A single word with LR = 150 is enough to almost certainly classify the email as spam, even though the prior spam rate was just 30%. This illustrates the power of high-likelihood-ratio features: they dominate the posterior even against a mild prior. In log space: log posterior odds = log(150) + log(0.4286) ≈ 5.01 + (−0.847) ≈ 4.16, giving sigmoid(4.16) ≈ 98.5%.
Q8. What happens to the posterior P(disease | +) as the false positive rate approaches 0? As it approaches 1?
As FPR → 0: the denominator (true positives + false positives) approaches the true positive count alone, so P(disease|+) → 1. A perfect specificity test rules in the disease with certainty on a positive — no healthy person ever tests positive, so any positive is genuine. As FPR → 1: every healthy person also tests positive, so the test gives no information about disease status. The positive pool is dominated by healthy people (since they are the vast majority when prevalence is low), so P(disease|+) → the prevalence itself — the test told you nothing. This is a useful sanity check: a worthless test (FPR = sensitivity, i.e., LR = 1) leaves the posterior equal to the prior.
Q9. An interviewer asks: "You detect an anomaly in our fraud system. The fraud rate is 0.05%, the detector has 95% sensitivity, and 1% FPR. Should you investigate this flag?" Walk through the reasoning.
Build the table for N = 100,000 transactions. Fraudulent: 50 (0.05% of 100k); 95% caught → 47.5 ≈ 48 true positives. Legitimate: 99,950; 1% flagged → 999.5 ≈ 1000 false positives. Precision = 48 / (48 + 1000) ≈ 4.4%. So about 1 in 22 flags is a real fraud. Whether to investigate depends on the cost structure: if investigating costs \$5 and catching fraud saves \$500, the expected value of investigating is 0.044 × \$500 − \$1 × \$5 = \$22 − \$5 = \$17 positive — investigate. If investigating costs \$50, EV = \$22 − \$50 = −\$28 — don't. The Bayes machinery gives P(fraud|flag) = 4.4%; the decision then folds in costs. Always present the precision number AND the cost argument. Reporting "95% accurate" to the business is dangerously misleading.
Q10. Derive why, in the odds form of Bayes' rule, the marginal likelihood P(B) cancels out.
Write Bayes' rule for A and for Aᶜ, both conditioned on evidence B: P(A|B) = P(B|A)P(A)/P(B) and P(Aᶜ|B) = P(B|Aᶜ)P(Aᶜ)/P(B). Divide the first equation by the second: P(A|B)/P(Aᶜ|B) = [P(B|A)P(A)] / [P(B|Aᶜ)P(Aᶜ)]. The P(B) denominators cancel (they are the same denominator). The result is posterior odds = LR × prior odds. This cancellation is why the odds form is so convenient: you never need to compute the normalising constant P(B), which requires summing over all possible states of A — sometimes an expensive integral in continuous models.
Q11. (Hard) A Bayesian A/B test observes 80 conversions out of 200 visitors for variant A, and 90 out of 200 for variant B. Priors are Beta(1,1) for both. What are the posteriors, and how do you compute P(B beats A)?
Beta-Binomial conjugacy: posterior after observing k successes out of n with Beta(α,β) prior is Beta(α+k, β+n−k). Variant A: prior Beta(1,1) + 80 successes, 120 failures → posterior Beta(81,121). Variant B: prior Beta(1,1) + 90 successes, 110 failures → posterior Beta(91,111). To compute P(θ_B > θ_A): draw many samples (e.g., 100,000) from each posterior distribution and count the fraction where the B draw exceeds the A draw. Analytically, this is an integral with no closed form for general Beta parameters, but numerically trivial. Mean of A's posterior: 81/(81+121) = 81/202 ≈ 40.1%. Mean of B's posterior: 91/202 ≈ 45.0%. The posteriors overlap substantially, so P(B>A) will be well below 100% — probably around 80-85%, meaning the evidence is suggestive but not definitive. The key advantage over frequentist testing: you get a direct probability statement about which variant is better, not just a binary reject/fail-to-reject decision, and you can inspect the distribution at any time without inflating false positive rates by a fixed-horizon design.
Q12. (Brutal) How does the likelihood ratio relate to log-loss in binary classification? Derive the connection.
In binary classification, the log-loss (cross-entropy) for a single example with true label y ∈ {0,1} and model probability p = P(y=1|x) is: L = −y log p − (1−y) log(1−p). The model computes log p/(1−p) = log(posterior odds of positive class). By the odds form of Bayes: log posterior odds = log LR + log prior odds. The logistic function σ(a) = 1/(1+e^{−a}) converts log odds to probability. So training to minimise log-loss is equivalent to training the model's log posterior odds to match the true log posterior odds — which requires learning the correct likelihood ratios for each feature (or combination of features). In a logistic regression, the weight w_j on feature x_j approximates log P(x_j | y=1) / P(x_j | y=0), i.e., the log LR of that feature — this is exactly the Naive Bayes connection. Logistic regression learns the discriminative LRs directly from data without the conditional independence assumption, which is why it tends to outperform Naive Bayes when those assumptions are violated.
05
PART II · RANDOM VARIABLES

Random variables, PMF / PDF / CDF

🎯A random variable is just a number attached to each outcome — the mathematical hook that lets us do arithmetic on chance.

Probability is about events; random variables let us assign numbers to those events so we can compute means, variances, and expectations. This chapter builds the three universal descriptors — PMF (discrete), PDF (continuous), and CDF (both) — and shows how to simulate any distribution in four lines of Python. These objects appear in every ML model: loss functions are expectations, activation outputs are random variables, sampling from a neural network posterior uses inverse-CDF.

What is a random variable?

Roll a fair six-sided die. The sample space is Ω = {1,2,3,4,5,6}. Define X(ω) = ω. That's it — X is a function from outcomes to real numbers. Nothing mysterious.

Now define Y = 1 if X is even, 0 otherwise. Y is a different random variable on the same experiment. One experiment, infinitely many possible RVs.

Discrete RV
Takes values in a countable set {0,1,2,…}. Described by a PMF.
Continuous RV
Takes values in an interval. Described by a PDF. P(X = any single point) = 0.
Mixed RV
Part discrete, part continuous. Rare in basics, common in survival analysis.
PMF — probability mass function (discrete)

Toy example first. Let Z be the net gain from a game: you win \$2 with prob 1/4, win \$0 with prob 1/2, lose \$1 with prob 1/4.

z−102
P(Z=z)1/41/21/4

That table is the PMF. Two rules it must satisfy:

$$p(x) \ge 0 \quad \text{for all } x, \qquad \sum_{x} p(x) = 1$$
Each mass is non-negative; all masses sum to exactly 1.

Check: 1/4 + 1/2 + 1/4 = 1. ✓

PDF — probability density function (continuous)

For a continuous RV X, we cannot list P(X=x) — it is always zero. Instead, probability lives in intervals:

$$P(a \le X \le b) = \int_a^b f(x)\,dx$$
f(x) is the density; probability = area under the curve between a and b.

Two rules for a valid PDF:

$$f(x) \ge 0, \qquad \int_{-\infty}^{\infty} f(x)\,dx = 1$$
Non-negative everywhere; total area under curve = 1.
⚠ Clears up: "density can exceed 1 — that's not a probability!"

Consider Uniform(0, 0.5): f(x) = 2 for x ∈ [0, 0.5]. The density is 2, which is > 1. That's fine — density is not probability. Probability is area: ∫₀^0.5 2 dx = 1. ✓

The density at a point tells you how concentrated probability is near that point, not the probability itself.

⚠ Clears up: "P(X = any exact value) = 0 for continuous RVs"

P(X = 0.5) = ∫₀.₅^0.5 f(x) dx = 0. This does NOT mean X=0.5 is impossible — it just means the probability of hitting exactly 0.5 (out of infinitely many real numbers) is zero. What matters is probability over intervals.

CDF — the universal object

Both discrete and continuous RVs have a CDF:

$$F(x) = P(X \le x)$$
F(x) is the probability that X takes a value at most x. Defined for every real x.

Properties every CDF must have:

  • Non-decreasing: if a ≤ b then F(a) ≤ F(b).
  • Right-continuous: F(x) = lim_{t→x⁺} F(t).
  • Limits: F(−∞) = 0, F(+∞) = 1.

Discrete CDF example. For the game above (Z ∈ {−1, 0, 2}):

xx < −1−1 ≤ x < 00 ≤ x < 2x ≥ 2
F(x)01/43/41

The CDF jumps at each mass point; the jump size equals the PMF value there.

Going back: PDF = dF/dx (where it exists); PMF = jump size at each point.

Quantiles — reading the CDF backwards

The p-th quantile (or p-th percentile) of X is the smallest value q such that F(q) ≥ p:

$$Q(p) = F^{-1}(p) = \inf\{x : F(x) \ge p\}$$
Q(p) is the inverse CDF — the value below which p fraction of the distribution lies.

Key quantiles to know: Q(0.5) = median; Q(0.25) = first quartile; Q(0.75) = third quartile. For a standard Normal, Q(0.975) ≈ 1.96 — the famous "1.96σ" CI boundary comes straight from this.

Transformations of random variables

Given X with known distribution, define Y = g(X). What is Y's distribution?

CDF method (always works): F_Y(y) = P(Y ≤ y) = P(g(X) ≤ y). Invert g, find the x-region, integrate f_X.

Jacobian formula (differentiable, monotone g):

$$f_Y(y) = f_X\!\left(g^{-1}(y)\right) \cdot \left|\frac{d}{dy}g^{-1}(y)\right|$$
The density of Y equals the density of X at the corresponding x-point, scaled by how much the transformation stretches or compresses the axis at that point.

Tiny example. X ~ Uniform(0,1); Y = −log X. Then g⁻¹(y) = e⁻ʸ and |d/dy e⁻ʸ| = e⁻ʸ. So f_Y(y) = 1 · e⁻ʸ = e⁻ʸ for y > 0 — that is the Exponential(1) density. This is how you transform a Uniform into an Exponential.

Simulation via inverse-CDF sampling

The inverse-transform method: if U ~ Uniform(0,1) and F is any CDF, then X = F⁻¹(U) has distribution F. This is the foundation of all basic random-number generation.

import numpy as np

# Sample from Exponential(lambda=2) using inverse-CDF
rng = np.random.default_rng(42)
U = rng.uniform(0, 1, size=10_000)   # Step 1: uniform samples
lam = 2.0
X = -np.log(U) / lam                 # Step 2: apply F_inv(u) = -log(u)/lambda
# X now has distribution Exponential(2)
print(X.mean(), X.var())             # Should be ~0.5, ~0.25

Why it works: P(F⁻¹(U) ≤ x) = P(U ≤ F(x)) = F(x) since U is Uniform(0,1). The CDF of F⁻¹(U) is exactly F. ✓

📐 If asked to "sample from distribution X" — the rule

Trigger: "How would you draw samples from [some custom distribution]?"

  1. Write down the CDF F(x).
  2. Invert it analytically: solve F(x) = u for x.
  3. Draw u ~ Uniform(0,1), return x = F⁻¹(u).
  4. If F⁻¹ has no closed form: mention rejection sampling or MCMC as fallbacks.

Never: skip the inversion step — interviewers want to see you know why this works (the Uniform pushforward argument), not just the recipe.

◆ Interview probe

"Can a PDF value be greater than 1?" — Yes. Density is not probability. Show Uniform(0, 0.1) where f(x) = 10.

"What's the difference between a PMF and a PDF?" — PMF gives actual probabilities at points (sums to 1); PDF gives density (integrates to 1, point probabilities are zero).

"Why is the CDF called the 'universal object'?" — Because it exists for every RV (discrete, continuous, mixed) and you can recover the PMF or PDF from it.

✓ Remember
  • RV = function from Ω to ℝ. PMF sums to 1; PDF integrates to 1.
  • CDF F(x) = P(X ≤ x) works for all RV types; non-decreasing, right-continuous.
  • Density CAN exceed 1 — only areas (integrals) are probabilities.
  • Inverse-CDF sampling: draw U ~ Uniform(0,1), return F⁻¹(U) → has distribution F.
Tricky interview questions — chapter 05
Q1. Define a random variable formally. What distinguishes it from a random outcome?
A random variable X is a measurable function X: Ω → ℝ mapping each outcome ω in the sample space to a real number. A random outcome ω is the raw result of an experiment; a random variable assigns a numerical value to that outcome. One experiment supports multiple random variables simultaneously (e.g., die roll: X=face value, Y=1{even}).
Q2. A PDF f(x) = c·e^{−2x} for x ≥ 0, 0 otherwise. Find c and P(X > 1).
Normalizing: ∫₀^∞ c·e^{−2x} dx = c/2 = 1, so c = 2. This is Exponential(2). P(X > 1) = ∫₁^∞ 2e^{−2x} dx = e^{−2} ≈ 0.135. Key step: recognize the exponential form, use the normalization condition to find c first.
Q3. Explain why P(X = x) = 0 for a continuous RV but individual outcomes still "happen".
Probability for continuous RVs is defined over intervals, not points. The probability of any single point equals the integral over a zero-width interval, which is 0. Yet X does take a specific value in each trial — it's a quirk of measure theory. Intuition: the reals are uncountably infinite, so any single value has "measure zero." Probability is about density near a point, not the point itself.
Q4. What four properties must a CDF satisfy? Why does it have to be right-continuous?
(1) Non-decreasing; (2) F(−∞)=0; (3) F(+∞)=1; (4) right-continuous. Right-continuity is a convention: F(x) = P(X ≤ x) includes the point x (using ≤, not <). At a jump discontinuity, we take the value at the right-hand limit so that F(x) = P(X ≤ x) stays self-consistent.
Q5. You want to sample from a distribution with CDF F(x) = 1 − e^{−x²} for x ≥ 0. Give the algorithm.
Invert: set u = 1 − e^{−x²}, then x = √(−log(1−u)). Since 1−U is also Uniform(0,1) when U is, you can simplify to x = √(−log U). Algorithm: draw U ~ Uniform(0,1); return X = √(−log U). This distribution is Rayleigh-like (specifically Weibull(2,1)).
Q6. What is the 97.5th percentile of the standard Normal, and why does it appear in confidence intervals?
Q(0.975) = Φ⁻¹(0.975) ≈ 1.96. A 95% two-sided CI for a mean is x̄ ± 1.96·σ/√n because we need to exclude 2.5% in each tail. The 1.96 is literally the inverse CDF of the standard Normal evaluated at 0.975 — it's not magic, just a quantile.
Q7. If X has CDF F, what is the distribution of Y = F(X)?
Y ~ Uniform(0,1). Proof: P(Y ≤ u) = P(F(X) ≤ u) = P(X ≤ F⁻¹(u)) = F(F⁻¹(u)) = u. This is the probability integral transform. It's why the inverse-CDF method works: F(X) ~ Uniform, so F⁻¹(Uniform) ~ F.
Q8. What is the Jacobian formula for transforming a density, and when does it fail?
For monotone differentiable g: f_Y(y) = f_X(g⁻¹(y)) · |dg⁻¹/dy|. The |Jacobian| factor accounts for how much g stretches or compresses the real line near that point. It fails when g is not monotone (e.g., Y = X²) — then you must split the domain into monotone pieces and sum. For multivariate transformations, the scalar derivative becomes the absolute determinant of the Jacobian matrix.
Q9. A random variable has PMF p(k) = (1−p)^{k−1}p for k = 1,2,3,… Verify it's valid and compute P(X > 3).
This is the Geometric distribution. Sum: Σₖ₌₁^∞ (1−p)^{k−1}p = p · 1/(1−(1−p)) = 1. ✓ P(X > 3) = 1 − P(X ≤ 3) = 1 − [p + p(1−p) + p(1−p)²] = (1−p)³. The complementary probability of "first success after trial 3" equals the probability of "three consecutive failures" — this is the memoryless property at work.
Q10. In code you see `np.random.exponential(scale=2)`. What is the inverse-CDF formula being used internally?
Exponential(rate λ) has CDF F(x) = 1 − e^{−λx}. Inverting: x = −log(1−u)/λ. With scale = 1/λ = 2 (so λ = 0.5), the formula is X = −log(U) · 2. NumPy draws U ~ Uniform(0,1) and applies this formula. Knowing this lets you implement any exponential-family sampler from scratch, and explains why log(0) must be guarded against when U = 0.
TL;DR

A random variable maps outcomes to numbers. Discrete RVs have PMFs (probability at points); continuous RVs have PDFs (density, not probability — can exceed 1). The CDF F(x) = P(X ≤ x) is the universal descriptor for all types. Quantiles invert the CDF. Simulate any distribution by drawing U ~ Uniform(0,1) and applying F⁻¹(U).

06
PART II · RANDOM VARIABLES

The distribution zoo and how it hangs together

🎯Every named distribution is a specialist tool — learn the one-line story and you'll always reach for the right one.

Named distributions are not a menagerie to memorize — they form a family tree where each member arises from a simple story. This chapter traces the genealogy from Bernoulli to Poisson to Exponential to Gaussian, gives you the one-line story, mean/variance, and the interview tell for each, and closes with a master comparison table tying every distribution to its ML cameo.

Family tree of named distributions: Bernoulli → Binomial → Poisson; Geometric → Negative Binomial; Uniform → Exponential → Gamma; Gamma + Gaussian → the Normal as CLT limit.
The discrete family: Bernoulli → Binomial → Poisson → Geometric → Negative Binomial

Bernoulli(p) — Story: one coin flip with heads probability p.

$$P(X=k) = p^k(1-p)^{1-k}, \quad k \in \{0,1\}$$
k is 1 (success) or 0 (failure); p is the success probability.

Mean = p. Var = p(1−p). ML cameo: the output of a logistic classifier; every binary cross-entropy loss assumes Bernoulli noise.

Binomial(n, p) — Story: count the number of heads in n independent coin flips.

$$P(X=k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k = 0,1,\dots,n$$
Choose k positions for heads out of n flips; each arrangement has probability p^k·(1−p)^{n−k}.

Mean = np. Var = np(1−p). Note: Binomial(1,p) = Bernoulli(p). Sum of independent Bernoullis is Binomial.

Poisson(λ) — Story: count rare events in a fixed time window (λ = expected count).

$$P(X=k) = \frac{e^{-\lambda}\lambda^k}{k!}, \quad k = 0,1,2,\dots$$
λ is both the mean and variance; e^{−λ} is the probability of zero events.

Mean = λ. Var = λ. The hallmark: mean equals variance. ML cameo: click counts, word counts in text (bag-of-words), arrival queues.

📐 Poisson limit theorem — when to use it

Trigger: "We have n trials, each succeeding with small probability p, and we care about the count."

If n is large and p is small so that λ = np stays moderate, Binomial(n,p) ≈ Poisson(λ). Rule of thumb: use Poisson when n ≥ 20 and p ≤ 0.05.

Never: use Poisson when p is not small — Binomial is exact, Poisson is an approximation.

Geometric and Negative Binomial

Geometric(p) — Story: number of trials until the first success (inclusive).

$$P(X=k) = (1-p)^{k-1}p, \quad k = 1,2,3,\dots$$
(1−p)^{k−1} is the probability of k−1 failures; p is the probability of success on trial k.

Mean = 1/p. Var = (1−p)/p². Caution: some textbooks define Geometric as the number of failures before success (starting at 0) — always check the support.

Memoryless property of the Geometric: P(X > m+k | X > m) = P(X > k). Past failures give no information about future success. The Geometric is the only discrete memoryless distribution.

Worked example: bus waiting

A bus arrives with probability p = 0.2 each minute. You just waited 5 minutes — what's the probability you wait at least 3 more minutes?

By memorylessness: P(X > 5+3 | X > 5) = P(X > 3) = (1−0.2)³ = 0.8³ = 0.512.

The 5 minutes already waited are completely irrelevant. The distribution has no memory of them.

Negative Binomial(r, p) — Story: number of trials until the r-th success. Geometric is the special case r=1.

Mean = r/p. Var = r(1−p)/p². ML cameo: over-dispersed count data where Var > Mean (unlike Poisson where they're equal); used for word frequency models in NLP.

The continuous family: Uniform → Exponential → Gamma → Gaussian

Uniform(a, b) — Story: every value in [a,b] equally likely.

$$f(x) = \frac{1}{b-a}, \quad x \in [a,b]$$
Constant density 1/(b−a) over the interval; zero outside.

Mean = (a+b)/2. Var = (b−a)²/12. Foundation of inverse-CDF sampling; Uniform(0,1) is the source of all pseudo-random numbers.

Exponential(λ) — Story: waiting time until first event in a Poisson process with rate λ.

$$f(x) = \lambda e^{-\lambda x}, \quad x \ge 0$$
λ is the rate (events per unit time); higher λ = shorter waits.

Mean = 1/λ. Var = 1/λ². The only continuous memoryless distribution. CDF: F(x) = 1 − e^{−λx}. ML cameo: inter-arrival times in queuing models; lifetime distributions in reliability; the base distribution in many Bayesian priors.

Poisson process plain words: Events arrive randomly over time such that (1) the number of arrivals in any interval depends only on its length, not location; (2) arrivals in disjoint intervals are independent; (3) at most one arrival per infinitesimal instant. Then: counts are Poisson(λt); inter-arrival times are Exponential(λ).

Gamma(α, β) — Story: sum of α independent Exponential(β) waiting times.

$$f(x) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x}, \quad x \ge 0$$
α = shape (how many exponentials summed); β = rate; Γ(α) = (α−1)! for integer α.

Mean = α/β. Var = α/β². Special cases: Gamma(1, β) = Exponential(β); Gamma(n/2, 1/2) = Chi-squared(n). ML cameo: conjugate prior for the Poisson rate and Gaussian precision.

The Gaussian (Normal) distribution — the CLT magnet

Normal(μ, σ²) — Story: the limiting distribution of sums of many small independent effects (Central Limit Theorem).

$$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$
μ = mean (center); σ² = variance (spread); the bell curve. 68% of mass within ±1σ; 95% within ±2σ; 99.7% within ±3σ.

Mean = μ. Var = σ². Standard Normal: μ=0, σ=1, written Z ~ N(0,1). Any Normal can be standardized: Z = (X−μ)/σ.

ML cameos everywhere: linear regression assumes Gaussian noise (MLE → OLS); neural network weights often initialized as Normal; CLT justifies using Normal approximations for sums/averages; the Gaussian kernel in SVMs and GPs; the reparameterization trick in VAEs samples X = μ + σ·ε with ε ~ N(0,1).

⚠ Clears up: "Gaussian is not 'the default' — it has assumptions"

The Normal distribution assumes: symmetric, unimodal, exponential tail decay. It's wrong for count data (use Poisson/NegBin), bounded data (use Beta/Dirichlet), heavy-tailed data (use Student-t or Cauchy), and strictly positive data (use LogNormal or Gamma). The CLT gives Gaussians as limits for averages — not for individual observations.

Log-Normal(μ, σ²) — If log X ~ Normal(μ, σ²), then X ~ LogNormal. Story: product of many small positive multiplicative effects. Mean = e^{μ+σ²/2}. ML cameo: income distributions, stock prices, file sizes — any quantity that is bounded below by 0 and right-skewed.

Beta(α, β) — Story: a probability on [0,1] with α−1 pseudo-successes and β−1 pseudo-failures as prior evidence.

$$f(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)}, \quad x \in [0,1]$$
B(α,β) = Γ(α)Γ(β)/Γ(α+β) is the normalizing constant. Mean = α/(α+β).

ML cameo: conjugate prior for a Bernoulli/Binomial proportion; Bayesian A/B testing; Thompson sampling for bandits.

Master comparison table
DistributionStory (one line)ParamsMeanVarML cameo
Bernoulli(p)One binary trialp ∈ (0,1)pp(1−p)Logistic output, binary CE loss
Binomial(n,p)Count successes in n trialsn,pnpnp(1−p)Proportion estimates, exact binomial CI
Poisson(λ)Count rare events in fixed windowλ>0λλWord counts (NLP), click events
Geometric(p)Trials to first successp ∈ (0,1)1/p(1−p)/p²Length of sequence until event
NegBin(r,p)Trials to r-th successr, pr/pr(1−p)/p²Over-dispersed counts in NLP
Uniform(a,b)All values equally likelya < b(a+b)/2(b−a)²/12Inverse-CDF sampling, random splits
Exponential(λ)Waiting time to first eventλ>01/λ1/λ²Inter-arrival times, survival models
Gamma(α,β)Sum of α exponential waitsα,β>0α/βα/β²Prior for Poisson rate / Gaussian precision
Normal(μ,σ²)CLT limit of many small effectsμ, σ²>0μσ²Linear regression noise, weight init, VAE
LogNormal(μ,σ²)Exp of Gaussian; product of effectsμ, σ²e^{μ+σ²/2}(e^{σ²}−1)e^{2μ+σ²}Income, file sizes, latency distributions
Beta(α,β)Prior probability on [0,1]α,β>0α/(α+β)αβ/((α+β)²(α+β+1))Bayesian A/B, Thompson sampling
Categorical(p)One draw from k classesp₁+…+pₖ=1Softmax output, language model tokens
📐 "When you hear X, reach for Y" — the interview tell

Trigger: Interviewer describes a scenario and asks you to model it.

  1. "Binary outcome, one trial" → Bernoulli
  2. "Count of successes in n fixed trials" → Binomial
  3. "Count of rare events in a fixed interval" → Poisson
  4. "Waiting time / time between events" → Exponential
  5. "Waiting for k-th event" → Gamma
  6. "Many small additive effects, averages, measurement noise" → Gaussian
  7. "A proportion or probability (value between 0 and 1)" → Beta
  8. "Strictly positive, right-skewed, products of factors" → LogNormal
  9. "Counts with Var > Mean (over-dispersed)" → Negative Binomial

Never: default to Gaussian without checking whether the quantity is bounded, discrete, or heavy-tailed. Always state the assumption you're making and why.

◆ Interview probe

"What's the difference between Poisson and Binomial?" — Binomial has fixed n trials; Poisson is the n→∞, p→0 limit where np=λ stays constant. Poisson is used when n is unknown or very large and p is small. The telltale: Poisson has mean = variance; Binomial has variance = mean·(1−p) < mean.

"Prove the Geometric is memoryless." — P(X > m+k | X > m) = P(X > m+k)/P(X > m) = (1−p)^{m+k}/(1−p)^m = (1−p)^k = P(X > k).

✓ Remember
  • Bernoulli → Binomial (sum n), Binomial → Poisson (rare limit), Geometric → NegBin (r successes).
  • Exponential is the continuous memoryless distribution; Geometric is the discrete memoryless distribution.
  • Poisson: mean = variance = λ. This is the diagnostic — over-dispersed data (Var > Mean) needs NegBin.
  • Normal is the CLT limit — for averages of many independent summands, not necessarily individual observations.
Tricky interview questions — chapter 06
Q1. A website gets 120 visits/hour on average. What's the probability of exactly 100 visits in one hour?
Visits ~ Poisson(120). P(X=100) = e^{−120}·120^{100}/100!. In practice use the log: log P = −120 + 100·log(120) − log(100!) ≈ −120 + 460.5 − 363.7 ≈ −23.2, so P ≈ e^{−23.2} ≈ 8×10⁻¹¹. The point here: computing Poisson PMF directly is numerically unstable — always work in log space.
Q2. Why does the Exponential distribution have the memoryless property, and what does it imply for modeling?
P(X > s+t | X > s) = P(X > t). Algebraically: the survival function S(x) = e^{−λx} satisfies S(s+t) = S(s)·S(t), which is the only function with this product property (Cauchy functional equation → exponential). Implication: past waiting time gives zero information about future waiting time. This is ideal for Poisson arrivals but wrong for human behavior (people get more likely to leave a queue the longer they wait → use Weibull/LogNormal).
Q3. Your count data has mean 5 and variance 20. Which distribution do you choose and why?
Negative Binomial, because Var > Mean (over-dispersion). Poisson forces Var = Mean = λ, so it would underestimate variance and produce too-narrow prediction intervals. NegBin has a free parameter (r or equivalently a dispersion parameter) that lets Var exceed Mean. Common in NLP (word counts), genomics (read counts), and web analytics (session lengths).
Q4. What is the relationship between the Gamma and Chi-squared distributions?
Chi-squared(k) = Gamma(k/2, 1/2). Specifically, if Z₁,…,Zₖ ~ N(0,1) independently, then Z₁²+…+Zₖ² ~ χ²(k). This is why Chi-squared appears in hypothesis tests about variances and goodness-of-fit: squared Normal deviations sum to Gamma-shaped distributions. Mean = k; Var = 2k.
Q5. Explain why the LogNormal is appropriate for latency distributions in distributed systems.
Latency is a product of many multiplicative factors (network hops, processing steps, each adding a random percentage overhead). By the multiplicative CLT, the log of a product of many small independent positive factors is approximately Normal, so the product itself is approximately LogNormal. This explains why latency distributions are right-skewed with a long tail — the 99th percentile is much larger than the mean, and log-transforming the data produces something roughly bell-shaped.
Q6. A coin has unknown bias p. You observe 7 heads in 10 flips. Write the likelihood L(p) and find the MLE.
X ~ Binomial(10, p). L(p) = C(10,7)·p⁷·(1−p)³. Log-likelihood: ℓ(p) = 7 log p + 3 log(1−p) + const. dℓ/dp = 7/p − 3/(1−p) = 0 → p̂ = 7/10 = 0.7. The MLE for a Binomial proportion is always the observed fraction. The combinatorial constant cancels in the differentiation — always work with log-likelihood to simplify.
Q7. What is the Poisson process and what are its three defining properties?
(1) Stationarity: the number of arrivals in any interval depends only on its length, not its position in time. (2) Independence: arrivals in disjoint intervals are independent. (3) Orderliness: P(2+ arrivals in tiny interval Δt) = o(Δt) — effectively at most one arrival per instant. Consequences: counts over [0,t] are Poisson(λt); inter-arrival times are Exponential(λ); superposition of two independent Poisson processes is Poisson(λ₁+λ₂).
Q8. The Beta(1,1) distribution — what is it, and why is it the "uninformative prior" for a coin's bias?
Beta(1,1) = Uniform(0,1) — every value of p in [0,1] is equally likely. It corresponds to 0 pseudo-successes and 0 pseudo-failures (α−1=0, β−1=0). When you observe k heads in n flips, the posterior is Beta(1+k, 1+n−k), which is the Laplace smoothed estimate. It's "uninformative" in the sense that it places equal prior probability on all bias values, though modern Bayesian statistics prefers other uninformative priors (Jeffreys prior for Bernoulli is Beta(1/2, 1/2)).
Q9. Two models predict daily sales: Model A uses Poisson(50), Model B uses Normal(50, 50). Compare them — which is more appropriate and when?
Both have mean 50. Poisson is discrete, non-negative, and has Var=Mean=50. Normal is continuous and can go negative. For daily sales counts (integer, ≥0, and roughly matching Var~Mean), Poisson is correct. Normal is an approximation that works well for large counts (by CLT) but can produce nonsense predictions (negative sales). If Var >> Mean (e.g., Var=500), neither Poisson nor Normal is right — use Negative Binomial.
Q10. What is the coefficient of variation (CV) and which distributions have a constant CV?
CV = σ/μ (standard deviation / mean). It measures relative variability. The Exponential distribution has CV = 1 always (σ = 1/λ = μ). The LogNormal has CV = √(e^{σ²}−1), which depends only on σ, not μ — so a family of LogNormals with the same σ but different μ all have the same CV. Gamma(α, β) has CV = 1/√α. CV is more meaningful than raw variance for comparing distributions with different means.
TL;DR

Distributions form a family tree: Bernoulli → Binomial → Poisson (rare-event limit); Geometric → Negative Binomial; Uniform → Exponential (memoryless waiting) → Gamma; all paths lead to Gaussian via CLT. Each has a one-line story; match the story to the data type (discrete/continuous, bounded/unbounded, Var vs Mean) to choose correctly.

07
PART II · RANDOM VARIABLES

Expectation, variance, and the indicator superpower

🎯Expected counts always decompose into a sum of indicators — and linearity of expectation makes the sum trivial even when the indicators are wildly dependent.

Expectation is the probabilistic center of mass, and variance measures how spread out the mass is. This chapter builds both from scratch, then reveals the indicator method — a single trick that dissolves most "expected count of X" problems in seconds. Along the way: LOTUS lets you compute $E[g(X)]$ without finding the distribution of $g(X)$, and Jensen's inequality tells you exactly which direction a nonlinear transformation biases an expectation — a fact that quietly underpins the ELBO in variational inference.

Expectation: long-run average and center of mass

If you run an experiment many times and average the outcomes, the number you converge to is the expectation. Formally, for a discrete random variable $X$:

$$E[X] = \sum_x x \cdot P(X = x)$$
Sum over every possible value $x$; weight each value by its probability; add up the weighted values.

Tiny example. A fair die: values 1–6, each with probability $\tfrac{1}{6}$.

$$E[X] = 1\cdot\tfrac{1}{6} + 2\cdot\tfrac{1}{6} + \cdots + 6\cdot\tfrac{1}{6} = \tfrac{21}{6} = 3.5$$
The expected value need not be a value the die can actually show — it is a theoretical average over infinitely many rolls.

For a continuous RV, the sum becomes an integral: $E[X] = \int_{-\infty}^{\infty} x\, f(x)\, dx$, where $f$ is the PDF.

Center of mass picture. Imagine the probability mass function as a physical bar chart. The expectation is the balance point. Shifting all mass right shifts $E[X]$ right by the same amount — that's linearity in action.

Linearity of expectation — the superpower that works under dependence

For ANY random variables $X$ and $Y$ (dependent or not) and constants $a, b$:

$$E[aX + bY] = a\,E[X] + b\,E[Y]$$
Scale each term, then add expectations — no independence required, no joint distribution needed.
📐 Why "works under dependence" is the whole point

Trigger: you need $E[\text{sum of many things}]$ and the things are tangled together.

  1. Write the sum: $X = X_1 + X_2 + \cdots + X_n$.
  2. Apply linearity: $E[X] = E[X_1] + E[X_2] + \cdots + E[X_n]$.
  3. Compute each $E[X_i]$ individually — often trivial.

Never: try to find the joint distribution of all $X_i$ to compute the expectation. That's unnecessary and usually intractable. Linearity sidesteps joint distributions entirely.

⚠ Clears up: "But they're not independent!"

Linearity $E[X+Y] = E[X] + E[Y]$ holds for ALL pairs $(X,Y)$. No independence. No uncorrelatedness. Always. This is not obvious — it's a theorem. The proof: $E[X+Y] = \sum_{x,y}(x+y)P(X{=}x,Y{=}y) = \sum_x x P(X{=}x) + \sum_y y P(Y{=}y)$. Done. Dependence enters only for $\text{Var}(X+Y)$, not for $E[X+Y]$.

The indicator method — three worked classics

An indicator random variable $I_A$ for event $A$ is the simplest possible RV: it equals 1 if $A$ occurs, 0 otherwise. Its expectation is just the probability of the event:

$$E[I_A] = 1 \cdot P(A) + 0 \cdot P(A^c) = P(A)$$
The expected value of an indicator equals the probability of the event it tracks.

The strategy: write the count you want as $X = I_1 + I_2 + \cdots + I_n$, then use linearity to get $E[X] = \sum_i P(A_i)$. The $I_i$ are almost always dependent — it doesn't matter.

📐 Rule: expected COUNT of anything → sum of indicators, always

Trigger: "how many X satisfy some property?" or "how many pairs/matches/collisions?"

  1. Define one indicator per possible object: $I_i = 1$ if object $i$ has the property.
  2. Count = $X = \sum_i I_i$.
  3. $E[X] = \sum_i P(I_i = 1)$ by linearity.
  4. Each $P(I_i = 1)$ is usually easy — just the probability for a single object.

Never: enumerate outcomes directly. For $n$ objects this is $O(2^n)$ case work; the indicator method is $O(n)$ thought.

Classic 1: Expected number of fixed points of a random permutation.

Setup. Shuffle $n$ cards labeled 1 through $n$. A fixed point is a card in its original position. What is $E[\text{fixed points}]$?

Without indicators. You would need to enumerate all $n!$ permutations, count fixed points in each, and average. Hopeless for large $n$.

With indicators. Let $I_i = 1$ if card $i$ is in position $i$. Then fixed points $= \sum_{i=1}^n I_i$. By linearity:

$$E\!\left[\sum_{i=1}^n I_i\right] = \sum_{i=1}^n P(\text{card } i \text{ in position } i) = \sum_{i=1}^n \frac{1}{n} = 1$$
Each card is equally likely to land in any of the $n$ positions, so $P(I_i=1)=1/n$. Sum over $n$ cards: answer is exactly 1 regardless of $n$.

The indicators $I_1, \ldots, I_n$ are heavily dependent (if $n-1$ cards are fixed, the last must be too), yet linearity gives us the answer in two lines.

Classic 2: Expected number of coupons to complete the collection.

Setup. There are $n$ distinct coupon types, each draw uniformly at random. How many draws $T$ until you have at least one of each type? (The coupon collector problem.)

Decompose the waiting time. Let $T_k$ = draws to go from $k-1$ distinct types to $k$ distinct types. Then $T = T_1 + T_2 + \cdots + T_n$. At stage $k$ you already have $k-1$ types, so any new draw is a success with probability $\frac{n-(k-1)}{n} = \frac{n-k+1}{n}$. The number of draws at stage $k$ is Geometric with success probability $p_k = \frac{n-k+1}{n}$, so $E[T_k] = \frac{n}{n-k+1}$. By linearity:

$$E[T] = \sum_{k=1}^n \frac{n}{n-k+1} = n\sum_{j=1}^n \frac{1}{j} = n\,H_n$$
$H_n = 1 + \tfrac{1}{2} + \cdots + \tfrac{1}{n}$ is the $n$-th harmonic number; $H_n \approx \ln n + 0.577$. For $n=10$ coupons: $E[T] \approx 10 \times 2.93 = 29.3$ draws.

This is used in ML for estimating how many examples you need before seeing every class at least once — relevant to rare-category sampling and dataset coverage analysis.

Classic 3: Expected number of birthday pairs.

Setup. $n$ people, each born on one of 365 days uniformly. How many pairs share a birthday?

With indicators. There are $\binom{n}{2}$ pairs. For pair $(i,j)$ let $I_{ij} = 1$ if they share a birthday. Then:

$$E[\text{pairs}] = \binom{n}{2} \cdot P(\text{two people share a birthday}) = \binom{n}{2} \cdot \frac{1}{365}$$
Each of the $\binom{n}{2}$ pairs shares a birthday with probability $1/365$ (given person $i$'s birthday, person $j$ must match it exactly). For $n=23$: $\binom{23}{2}/365 = 253/365 \approx 0.69$ expected pairs.

This is consistent with the famous fact that 23 people gives ~50% chance of at least one shared birthday — there's roughly $0.69$ expected pairs, so one pair is quite likely.

Variance and covariance from first principles

Variance measures the average squared deviation from the mean:

$$\text{Var}(X) = E[(X - E[X])^2] = E[X^2] - (E[X])^2$$
The second form (computational formula) is often easier: compute $E[X^2]$ and $E[X]$ separately, then subtract the square of the mean.

The standard deviation $\sigma = \sqrt{\text{Var}(X)}$ restores the original units.

Covariance measures how two RVs move together:

$$\text{Cov}(X,Y) = E[(X - E[X])(Y - E[Y])] = E[XY] - E[X]\,E[Y]$$
Positive covariance: when $X$ is above its mean, $Y$ tends to be above its mean too. Zero does NOT imply independence — only that there's no linear trend.

Variance of a sum — the full formula:

$$\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\,\text{Cov}(X, Y)$$
For independent $X, Y$: $\text{Cov}(X,Y)=0$, so variances simply add. For dependent variables, the covariance term can make the sum's variance larger or smaller than the sum of individual variances.

Numeric 2-variable example. Let $X$ be the number shown on a fair coin mapped to $\{0,1\}$ (heads=1, tails=0), and $Y = 1 - X$ (the opposite). Then $E[X]=E[Y]=0.5$, $\text{Var}(X)=\text{Var}(Y)=0.25$.

$\text{Cov}(X,Y) = E[XY] - E[X]E[Y] = 0 - 0.25 = -0.25$ (since $XY=0$ always — they can't both be 1).

$$\text{Var}(X+Y) = 0.25 + 0.25 + 2(-0.25) = 0$$
$X+Y=1$ always (heads+tails=1), so of course its variance is zero. The formula captures this perfectly via the $-0.5$ covariance correction.

General formula for $n$ variables:

$$\text{Var}\!\left(\sum_{i=1}^n X_i\right) = \sum_{i=1}^n \text{Var}(X_i) + 2\sum_{i < j} \text{Cov}(X_i, X_j)$$
There are $\binom{n}{2}$ covariance pairs. If all $X_i$ are uncorrelated (not necessarily independent), the cross terms vanish and variances simply add. Independence implies uncorrelated; uncorrelated does NOT imply independence.
⚠ Clears up: linearity holds for E but NOT for Var

$E[X+Y] = E[X] + E[Y]$ always. But $\text{Var}(X+Y) = \text{Var}(X) + \text{Var}(Y)$ only when $\text{Cov}(X,Y)=0$. Forgetting the covariance term when variables are dependent is a common interview error. The safe habit: always write the full formula and then check whether covariance terms are zero.

LOTUS — compute $E[g(X)]$ without finding the distribution of $g(X)$

The Law of the Unconscious Statistician (LOTUS) says: to find the expected value of a function $g(X)$, you don't need the distribution of $g(X)$ — just plug into the original distribution of $X$.

$$E[g(X)] = \sum_x g(x)\, P(X = x) \quad \text{(discrete)}$$
Weight each value of $g(x)$ by the probability that $X=x$, using $X$'s own PMF. No need to derive the PMF of $g(X)$ first.
$$E[g(X)] = \int_{-\infty}^{\infty} g(x)\, f_X(x)\, dx \quad \text{(continuous)}$$
Same idea with $X$'s PDF. This avoids computing the Jacobian or the CDF of $g(X)$.

Example. $X \sim \text{Uniform}\{1,2,3,4,5,6\}$ (fair die). What is $E[X^2]$?

LOTUS: $E[X^2] = \sum_{x=1}^6 x^2 \cdot \tfrac{1}{6} = \tfrac{1+4+9+16+25+36}{6} = \tfrac{91}{6} \approx 15.17$. Then $\text{Var}(X) = E[X^2] - (E[X])^2 = \tfrac{91}{6} - 3.5^2 = \tfrac{91}{6} - \tfrac{49}{4} = \tfrac{182-147}{12} = \tfrac{35}{12} \approx 2.92$.

LOTUS is used constantly: computing variance (via $E[X^2]$), moments, moment-generating functions, and the KL divergence $E[\log p(X)/q(X)]$ — all applications of LOTUS.

Jensen's inequality — curvature bends expectations

For convex f: E[f(X)] ≥ f(E[X]); for concave f the inequality flips. Picture: a chord above a convex curve — averaging the outputs lands you on the chord, above the function at the average input.

$$\mathbb{E}[\log X] \le \log \mathbb{E}[X]$$
log is concave, so the average of logs sits below the log of the average — with equality only when X is constant.

Where it bites in ML: (1) log-likelihood of an average ≠ average log-likelihood — why ensembling probabilities then logging beats averaging log-probs; (2) the ELBO: log p(x) = log E_q[p(x,z)/q(z)] ≥ E_q[log p(x,z)/q(z)] — variational inference IS one application of Jensen; (3) E[1/X] ≥ 1/E[X] — why averaging rates and averaging times disagree (harmonic-mean traps in throughput math).

TL;DR

Linearity of expectation needs no independence — that's the superpower. Any "expected number of…" question becomes: define an indicator per potential event, compute each P(event), add. Variance is where dependence re-enters (the covariance cross-terms), LOTUS computes E[g(X)] without finding g(X)'s distribution, and Jensen warns you that nonlinear functions and expectations don't commute — with log-likelihoods as the ML cash-out.

Tricky interview questions — chapter 07
Q1. 100 people, random hat returns. Expected number who get their own hat — and why is independence irrelevant?
Indicator per person: P(own hat) = 1/100; E = 100 × 1/100 = 1. The indicators are dependent (if 99 match, the last must), but linearity never asks for independence — that's the entire trick. (Variance DOES need the dependence: it also equals 1 here, a classic follow-up.)
Q2. Coupon collector: expected boxes to collect all 50 coupons?
Stage k (have k−1 coupons): success probability (50−k+1)/50, expected wait 50/(51−k). Sum: 50 × (1/50 + 1/49 + … + 1/1) = 50·H₅₀ ≈ 50 × 4.499 ≈ 225. Decompose into geometric stages, use linearity over stage waits.
Q3. In n=23 people, expected number of PAIRS sharing a birthday?
C(23,2) = 253 pairs, each matching with probability 1/365: E = 253/365 ≈ 0.69. Note this complements the classic birthday probability (~50.7% at 23): expected pairs near 0.7 is exactly when P(at least one) crosses ~1/2 — the Poisson approximation P ≈ 1−e^(−0.69) ≈ 0.50 ties them together.
Q4. Var(X+Y) = 25, Var(X) = 9, Var(Y) = 4. What's Cov(X,Y) and corr?
25 = 9 + 4 + 2Cov → Cov = 6; corr = 6/(3×2) = 1. The variables are perfectly positively correlated — and the interviewer's follow-up is that corr = 1 means Y is an exact linear function of X.
Q5. Why is E[1/X] ≠ 1/E[X], and where does this bite in systems work?
1/x is convex for x>0, so Jensen gives E[1/X] ≥ 1/E[X]. Bite: average latency vs average throughput. If half your requests run at 1 req/s and half at 9 req/s, mean rate is 5, but mean time-per-request is (1 + 1/9)/2 ≈ 0.56s → effective rate 1.8, not 5. Mixing rates and times without care misprices capacity by multiples.
Q6. Expected number of runs in 10 fair coin flips (maximal blocks of same face)?
Runs = 1 + (number of switches). Indicator per adjacent pair switching: P = 1/2, 9 pairs → E[switches] = 4.5 → E[runs] = 5.5. The "1 +" boundary term is the part candidates forget.
Q7. LOTUS in one line: compute E[X²] for a die without finding the distribution of X².
E[X²] = Σ x²·P(X=x) = (1+4+9+16+25+36)/6 = 91/6 ≈ 15.17 — evaluate g at each x and weight by X's own probabilities; you never need to ask "what is the PMF of X²". Then Var = 91/6 − (3.5)² = 35/12.
Q8. A portfolio holds 3 assets each with variance 1 and pairwise correlation ρ. Variance of the equal-weight average — and the diversification limit?
Var(mean) = (1/9)[3·1 + 6ρ] = (1 + 2ρ)/3. As n→∞ with constant ρ: Var → ρ. Idiosyncratic risk diversifies away; the correlated component is the floor — the math behind "diversification can't remove systematic risk," and identically behind why correlated model errors cap ensemble gains.
Q9. Why does maximizing log-likelihood instead of likelihood change nothing — and what does Jensen warn about averaging them?
log is monotone, so the argmax is identical (and sums beat products numerically). Jensen's warning is about AVERAGING: mean log-likelihood ≤ log mean likelihood, so geometric-mean-style scores (perplexity) penalize occasional terrible predictions far more than arithmetic averaging would — one zero-probability event sends log-loss to −∞ while accuracy barely notices.
Q10. E[XY] = E[X]E[Y] — does this imply independence? Prove your answer with a 3-point example.
No — it's exactly uncorrelatedness. X uniform on {−1, 0, 1}, Y = X²: E[XY] = E[X³] = 0 = E[X]E[Y] (since E[X]=0), yet Y is determined by X. Independence requires factorization of the whole joint distribution, not one moment identity.
08
PART II · RANDOM VARIABLES

Joints, conditionals, and the tower rule

🎯Every regression, every mixture model, every conditional probability boils down to one idea: the joint distribution is the full story, and everything else is just a projection or a slice.

Two random variables together contain far more information than their individual distributions alone. This chapter builds the machinery of joint distributions from scratch—computing marginals and conditionals by hand from a 4-cell table, unpacking the subtle differences among covariance, correlation, and independence, and then proving why conditional expectation is the mathematically optimal predictor. The tower rule and law of total variance let you decompose hard expectations into manageable pieces; the bivariate Gaussian shows where all of this lands geometrically. This machinery underlies linear regression, mixture models, and latent-variable methods throughout ML.

Joint distributions: the full story in one table

A joint distribution over two random variables X and Y assigns a probability to every pair of values (x, y). For discrete RVs the joint PMF is:

$$p_{X,Y}(x,y) = P(X = x,\; Y = y)$$
p_{X,Y}(x,y): probability that X takes value x AND Y takes value y simultaneously.

All joint probabilities must sum to 1 over every pair of values. From the joint you can always recover the individual (marginal) distributions by summing out the other variable:

$$P(X = x) = \sum_{y} P(X = x,\, Y = y)$$
Sum over all values of Y; the y's "disappear," leaving the marginal distribution of X alone.
Worked example: 4-cell joint table

Let X = "first coin flip" and Y = "second coin flip," both fair. Each takes values in {0, 1} (0 = tails, 1 = heads). The flips are independent, so the joint probability of every pair is 1/4.

Y = 0 Y = 1 P(X = x) marginal
X = 0 1/4 1/4 1/2
X = 1 1/4 1/4 1/2
P(Y = y) marginal 1/2 1/2 1 ✓

Now make it more interesting. Suppose X ∈ {0, 1} with P(X=0) = 0.6, P(X=1) = 0.4, and given X the conditional distribution of Y is:

  • P(Y=1 | X=0) = 0.3 (so P(Y=0 | X=0) = 0.7)
  • P(Y=1 | X=1) = 0.8 (so P(Y=0 | X=1) = 0.2)

The joint table (each cell = P(X=x) × P(Y=y|X=x), verified to sum to 1):

Y = 0 Y = 1 Marginal P(X=x)
X = 0 0.6 × 0.7 = 0.42 0.6 × 0.3 = 0.18 0.60
X = 1 0.4 × 0.2 = 0.08 0.4 × 0.8 = 0.32 0.40
Marginal P(Y=y) 0.50 0.50 1.00 ✓

Arithmetic check: 0.42 + 0.18 + 0.08 + 0.32 = 1.00 ✓. Marginal P(Y=1) = 0.18 + 0.32 = 0.50 ✓.

Now recover a conditional from the joint. What is P(X=1 | Y=1)?

$$P(X=1 \mid Y=1) = \frac{P(X=1,\, Y=1)}{P(Y=1)} = \frac{0.32}{0.50} = 0.64$$
Numerator: the joint cell. Denominator: the marginal column sum for Y=1. This is just Bayes' rule in action.

Interpretation: even though only 40% of the population has X=1, among those with Y=1 the fraction rises to 64%—because X=1 makes Y=1 much more likely.

Covariance, correlation, and independence

Covariance measures how two RVs move together. Positive = they tend to rise together; negative = one rises as the other falls; zero = no linear tendency.

$$\text{Cov}(X,Y) = E[(X - \mu_X)(Y - \mu_Y)] = E[XY] - E[X]\,E[Y]$$
μ_X = E[X], μ_Y = E[Y]. The computational form on the right avoids centering: just compute E[XY] and subtract the product of means. Units: product of the units of X and Y.

Correlation is covariance scaled to [−1, 1] by dividing by both standard deviations:

$$\rho(X,Y) = \frac{\text{Cov}(X,Y)}{\sigma_X \,\sigma_Y}$$
σ_X, σ_Y: standard deviations of X and Y respectively. ρ = ±1 only when Y is an exact linear function of X. Dimensionless.
⚠ Clears up: Corr = 0 does NOT mean independent

Correlation measures linear association only. The classic counterexample: let X ~ Uniform(−1, 1) and Y = X².

  • E[X] = 0 by symmetry of the Uniform.
  • E[XY] = E[X · X²] = E[X³] = 0 by symmetry (odd function integrated over symmetric interval).
  • Therefore Cov(X, Y) = E[XY] − E[X]E[Y] = 0 − 0·E[Y] = 0.
  • Correlation ρ(X, X²) = 0.
  • Yet Y = X² is a deterministic function of X — as far from independent as possible.

Independence requires all moments and all events to be uncoupled, not just the first-order linear relationship. Formally: X ⊥ Y iff P(X ∈ A, Y ∈ B) = P(X ∈ A)·P(Y ∈ B) for every measurable A, B.

The hierarchy: Independence ⟹ Uncorrelated. Uncorrelated does NOT ⟹ Independent. (Exception: jointly Gaussian RVs, where uncorrelated does imply independent.)

Independence
Joint = product of marginals for all events. The strongest condition. Knowing Y tells you nothing about X.
Uncorrelated
Cov(X,Y) = 0. Strictly weaker. Only rules out linear prediction from one to the other.
Cov(X,Y) > 0
X and Y tend to be above their means simultaneously. E.g., height and weight.
Cov(X,Y) < 0
When X is large, Y tends to be small. E.g., study hours and exam mistakes.
📐 If you get this question — the rule

Trigger: "Are X and Y independent if their correlation is zero?"

  1. Answer: Not necessarily. Correlation only captures linear dependence.
  2. Give the X, X² counterexample immediately: Uniform(−1,1), Y = X², Cov = 0, but Y is determined by X.
  3. State the exception: for jointly Gaussian RVs, ρ = 0 does imply independence.

Never: Confuse "uncorrelated" with "independent" — this is one of the most common mistakes in ML interviews.

TL;DR

Joint distributions are where probability becomes multivariate reality: marginals come from summing rows, conditionals from shrinking to a row and renormalizing, covariance measures linear co-movement (and only linear — corr 0 ≠ independent, the X, X² example), conditional expectation E[Y|X] is the best predictor of Y from X (regression's birthplace), and the tower rule E[E[Y|X]] = E[Y] plus the law of total variance let you compute by conditioning on what you wish you knew. For jointly Gaussian variables everything linearizes — which is exactly why Gaussians are the workhorse.

Tricky interview questions — chapter 08
Q1. Give a concrete pair of dependent but uncorrelated random variables, and explain in one line why correlation misses it.
X ~ N(0,1), Y = X². Cov(X,Y) = E[X³] − E[X]E[X²] = 0 − 0 = 0, yet Y is a deterministic function of X. Correlation only detects LINEAR association; Y's dependence on X is purely nonlinear (symmetric), so the linear probe reads zero.
Q2. Two coins: fair (p=0.5) and biased (p=0.9), picked with probability ½ each, flipped 10 times. Use the tower rule and total variance for the count's mean and variance.
E[N] = E[E[N|coin]] = ½(5) + ½(9) = 7. Var = E[Var(N|coin)] + Var(E[N|coin]) = ½(10·.5·.5) + ½(10·.9·.1) + [½(5−7)² + ½(9−7)²] = (1.25 + 0.45) + 4 = 5.7. The decomposition's story: 1.7 of the variance is coin-flip noise, 4.0 is not-knowing-which-coin — "within" plus "between."
Q3. Why is E[Y|X] called the best predictor — best in what sense, and what's the ML cameo?
Among all functions g(X), E[(Y − g(X))²] is minimized by g(X) = E[Y|X] — the MMSE property (proof: add and subtract E[Y|X], cross term vanishes by the tower rule). Every regression model is an attempt to approximate this conditional expectation; squared loss is precisely the choice that makes the conditional mean the target.
Q4. For a bivariate Gaussian with correlation ρ, what is E[Y | X = x], and what does it explain about "regression to the mean"?
E[Y|X=x] = μ_Y + ρ(σ_Y/σ_X)(x − μ_X) — linear in x. With ρ < 1, an x that is k standard deviations extreme predicts y at only ρ·k SDs extreme: tall parents → tall-but-less-extreme children. Regression to the mean is not a force; it's the geometry of imperfect correlation.
Q5. Var(X+Y) for correlated X, Y — formula and a portfolio one-liner.
Var(X+Y) = Var(X) + Var(Y) + 2Cov(X,Y). Diversification works because assets with low or negative covariance make the cross term shrink or negative — total risk less than the sum. With ρ = 1 nothing is gained; with ρ = −1 risk can cancel entirely.
Q6. From a 2×2 joint table with P(A=1,B=1)=0.2, P(A=1)=0.4, P(B=1)=0.5 — independent? Compute the check.
Independence requires P(A=1,B=1) = P(A=1)P(B=1) = 0.4 × 0.5 = 0.2 — matches. But for full independence every cell must factorize; with binary variables, one cell matching forces the rest (complements), so yes, independent. With more than two levels, you must check all cells — partial matches prove nothing.
Q7. Why does the law of total variance guarantee Var(Y) ≥ Var(E[Y|X]), and what does the gap mean in an ML eval context?
Var(Y) = E[Var(Y|X)] + Var(E[Y|X]) and the first term is non-negative. Var(E[Y|X]) is the variance your features can explain; E[Var(Y|X)] is irreducible noise given X. The ratio Var(E[Y|X])/Var(Y) is exactly R² 's population analogue — an upper bound on how well ANY model using X can do, which is why error floors exist no matter the architecture.
Q8. Covariance is bilinear: expand Cov(2X + 3, Y − 1) and state the two rules used.
Constants shift means but not co-variation: Cov(2X+3, Y−1) = 2Cov(X,Y). Rules: additive constants vanish (covariance is about deviations), scalar multipliers factor out (bilinearity). This is the algebra that makes variance-of-sums and CUPED-style adjustments mechanical.
Q9. You observe corr(ice cream sales, drownings) = 0.7. Walk the three causal readings and the variable that resolves it.
X→Y, Y→X, or confounder Z→both. Here Z = summer/temperature drives both; conditioning on season (stratify or regress out) collapses the correlation toward zero. The probe is testing that you say "correlation constrains, causation requires intervention or assumptions" with a concrete deconfounding move, not just the slogan.
Q10. In a recommender, user-level CTRs are estimated per segment. Why does the law of total variance argue for hierarchical/shrunk estimates?
Observed segment variance = within-segment sampling noise + true between-segment variance. Small segments are dominated by the first term — their raw CTRs scatter far more than the truth does. Shrinking each segment toward the global mean (weight ∝ segment sample size) subtracts the noise component, which is exactly empirical-Bayes logic — chapter 14's shrinkage, motivated by chapter 8's identity.
09
PART II · RANDOM VARIABLES

Inequalities, LLN, and the CLT

🎯Markov gives you a sledgehammer, Chebyshev a chisel, Hoeffding a laser — and the CLT explains why virtually everything in ML averages to a Gaussian.

This chapter is the bridge between a single random variable and the behaviour of many. We start with three concentration inequalities — Markov, Chebyshev, and Hoeffding — each bounding the probability of a large deviation, each sharper than the last. Then we meet the Law of Large Numbers (convergence of averages) and the Central Limit Theorem (shape of that convergence). Every confidence interval, every batch gradient, every error bar in an ML paper lives here.

The concentration toolkit

A concentration inequality answers: "How likely is a random variable to be far from its mean?" The three canonical tools trade assumption strength for tightness. We will carry one running example all the way through so the tightening is concrete.

Running example — a fair six-sided die

Roll a single fair die once. Let X be the outcome. Mean: $\mu = E[X] = 3.5$. Variance: $\sigma^2 = \text{Var}(X) = \tfrac{35}{12} \approx 2.917$. Standard deviation: $\sigma \approx 1.708$.

Question we will bound three ways: $P(X \ge 6)$. Exact answer (for calibration): $\tfrac{1}{6} \approx 0.167$.

1. Markov's inequality — the sledgehammer

Markov needs only one thing: X is non-negative. Nothing about variance or distribution shape.

$$P(X \ge a) \;\le\; \frac{E[X]}{a} \qquad (a > 0,\; X \ge 0)$$
The probability X reaches level a is at most the mean divided by a. Proof in two lines: $E[X] \ge a \cdot P(X \ge a)$ by splitting the expectation integral.

Running example. $a = 6$, $E[X] = 3.5$:

$$P(X \ge 6) \;\le\; \frac{3.5}{6} \approx 0.583$$
Markov bound: 58.3% — exact answer is 16.7%. The bound is not wrong; it is just very loose.

Why so loose? Markov knows nothing about spread. A distribution could put all its mass at $a - \varepsilon$ and still satisfy $E[X] = 3.5$. Knowing the variance fixes that.

2. Chebyshev's inequality — adding the variance

Apply Markov to the non-negative variable $(X - \mu)^2$. The result bounds the probability of deviating from the mean by more than $k$ standard deviations.

$$P\!\left(|X - \mu| \ge k\sigma\right) \;\le\; \frac{1}{k^2}$$
Within $k$ standard deviations of the mean with probability at least $1 - 1/k^2$. Requires only finite variance; distribution shape irrelevant.

Running example. $X \ge 6$ means $X - 3.5 \ge 2.5$, i.e., deviation $\ge 2.5$. With $\sigma \approx 1.708$: $k = 2.5 / 1.708 \approx 1.464$.

$$P(X \ge 6) \;\le\; P(|X - 3.5| \ge 2.5) \;\le\; \frac{1}{1.464^2} \approx 0.467$$
Chebyshev bound: 46.7% — still loose but tighter than Markov's 58.3%.

The $1/k^2$ decay is slow. Chebyshev is tight only for pathological distributions (two-point). For bounded random variables we can do exponentially better.

3. Hoeffding's inequality — exponential tails for bounded variables

When $X$ is bounded in $[a, b]$ (or more generally when we have a sum of independent bounded terms), Hoeffding gives an exponentially decaying bound — the real workhorse in PAC learning, bandit theory, and generalisation bounds.

$$P\!\left(\bar{X}_n - \mu \;\ge\; t\right) \;\le\; \exp\!\left(\!-\frac{2n^2 t^2}{\sum_{i=1}^n (b_i - a_i)^2}\right)$$
$\bar{X}_n$: sample mean of $n$ independent variables; each $X_i \in [a_i, b_i]$; $t$: the deviation threshold; bound decays exponentially in $t^2$ and $n$.

Running example, single die ($n=1$). $X \in [1, 6]$, range $= 5$, $t = 6 - 3.5 = 2.5$.

$$P(X \ge 6) = P(X - 3.5 \ge 2.5) \;\le\; \exp\!\left(\!-\frac{2 \cdot 1^2 \cdot 2.5^2}{5^2}\right) = e^{-0.5} \approx 0.607$$
For n=1 and a right-tail bound, Hoeffding is actually looser than Chebyshev here — the exponential advantage only kicks in clearly with sums of many variables.
Where Hoeffding wins: 100 dice, P(mean ≥ 4)

With $n=100$ dice, $t = 4 - 3.5 = 0.5$, range $= 5$:

Chebyshev: $P(\bar{X} - 3.5 \ge 0.5) \le P(|\bar{X}-3.5| \ge 0.5)$. $\text{Var}(\bar{X}) = \sigma^2/n \approx 2.917/100 = 0.02917$, $\sigma_{\bar X} \approx 0.1708$. $k = 0.5/0.1708 \approx 2.93$, bound $\le 1/k^2 \approx 0.117$.

Hoeffding: $\exp(-2 \cdot 100^2 \cdot 0.25 / (100 \cdot 25)) = \exp(-2) \approx 0.135$.

Here Chebyshev actually wins numerically, but Hoeffding's bound on the exact tail $P(\bar X - \mu \ge t)$ (one-sided) is $e^{-2} \approx 0.135$. At $t = 1$ (mean ≥ 4.5), Hoeffding gives $e^{-8} \approx 0.00034$ vs Chebyshev's $0.029$ — the exponential win is clear for larger deviations.

Markov
Needs: $X \ge 0$, E[X]. Bound: $\mu/a$. Use when that's all you have.
Chebyshev
Needs: finite variance. Bound: $1/k^2$. Universal; useful for small n.
Hoeffding
Needs: bounded support + independence. Bound: exponential in $t^2 n$. The PAC/bandit workhorse.
The Law of Large Numbers

Informally: "the sample mean converges to the true mean as $n \to \infty$." Formally, two flavours.

$$\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i \;\xrightarrow{\;p\;}\; \mu \qquad \text{(Weak LLN)}$$
For every $\varepsilon > 0$, $P(|\bar X_n - \mu| > \varepsilon) \to 0$ as $n \to \infty$. The sample mean gets close to $\mu$ with high probability. Proof: Chebyshev with $\text{Var}(\bar X_n) = \sigma^2/n \to 0$.
$$\bar{X}_n \;\xrightarrow{\;\text{a.s.}\;}\; \mu \qquad \text{(Strong LLN)}$$
Almost surely: the path $\bar X_1, \bar X_2, \ldots$ converges to $\mu$ with probability 1 — a stronger sample-path statement.
The Central Limit Theorem — why everything becomes Gaussian

Sum many independent, small-influence random effects and the sum's distribution approaches a Gaussian — regardless of the ingredients' shape. Formally: for i.i.d. X₁…Xₙ with mean μ and variance σ²,

$$\frac{\bar X_n - \mu}{\sigma/\sqrt{n}} \;\xrightarrow{d}\; \mathcal{N}(0,1)$$
The standardized sample mean converges in distribution to a standard normal: x̄ behaves like N(μ, σ²/n) for large n. The √n is the rate — error bars shrink as 1/√n.

Worked example: roll a fair die 100 times; P(sum > 380)? Per roll: μ = 3.5, σ² = 35/12 ≈ 2.917. Sum: mean 350, variance 291.7, SD ≈ 17.08. z = (380.5 − 350)/17.08 ≈ 1.79 (with continuity correction) → P ≈ 1 − Φ(1.79) ≈ 3.7%. A die is flat as can be — yet 100 of them produce a clean bell.

Sample-mean histograms of a skewed distribution at n = 1, 5, 30 — the bell emerges by n ≈ 30.
⚠ Clears up

The CLT is about the sample mean's distribution, not the data's. Your data stays skewed forever; it's the AVERAGE of n draws whose distribution normalizes. Also, "n ≥ 30" is folklore, not law: heavy skew needs more, near-symmetric needs fewer, and infinite-variance distributions (Cauchy) never comply.

Where this bites in ML
  • Mini-batch gradients are sample means — approximately Gaussian noise around the true gradient, with variance ∝ 1/batch size. Batch size is a noise dial; LR scaling rules are downstream of this fact.
  • Eval error bars: accuracy on n examples carries SE = √(p(1−p)/n) ≈ 0.5/√n at worst — on 1,000 examples that's ±1.5%, so a 0.8% "improvement" is noise. Always report n with the metric.
  • A/B tests stand entirely on CLT-normality of metric means — chapters 12–13 cash this in.
📐 If asked "estimate X from n samples, how confident are you" — the rule
  1. Estimate = sample mean x̄. Uncertainty: SE = s/√n.
  2. Quote x̄ ± 1.96·SE for 95% — licensed by the CLT once n is moderately large.
  3. Sanity-check the regime: heavy tails or tiny n → say "CLT is shaky here, I'd bootstrap or use t."
  4. If asked "how many samples for ±ε?": n = (1.96·σ/ε)² — precision costs quadratically.
TL;DR

The inequality ladder — Markov (mean only, crude) → Chebyshev (adds variance) → Chernoff/Hoeffding (adds independence, exponential) — buys guarantees with assumptions. LLN says sample means converge to the truth; CLT says HOW they fluctuate on the way: Gaussianly, with SD σ/√n. That √n is the master constant of applied statistics — it prices every error bar, every A/B test, and every "how much data do we need."

Tricky interview questions — chapter 09
Q1. A model's accuracy is "73.4% on the test set." What single follow-up exposes whether that number means anything?
"On how many examples?" SE ≈ √(0.734×0.266/n): at n=500 it's ±3.9% (95% CI) — a 2-point gain over a baseline is invisible; at n=50,000 it's ±0.39%. The reflex of attaching √(p(1−p)/n) to every reported metric is the cheapest credibility signal in ML interviews.
Q2. Use Markov, then Chebyshev, on: mean latency 100ms, SD 20ms — bound P(latency > 200ms).
Markov (mean only): P ≤ 100/200 = 50%. Chebyshev (k = 5 SDs): P ≤ 1/25 = 4%. The same question with more information yields a 12× tighter bound; with independence assumptions (Chernoff-style) it would be exponentially tighter still. The ladder is "pay assumptions, receive tightness."
Q3. Why does the gambler's fallacy NOT follow from the law of large numbers?
LLN says the PROPORTION converges: after 10 heads, the next million flips average ~50%, diluting the early excess — it never says tails become more likely to "compensate." The absolute excess of heads can even grow (like √n) while the proportion converges. Coins have no memory; LLN is about ratios, not corrections.
Q4. Hoeffding gives P(|x̄−μ| > ε) ≤ 2exp(−2nε²) for bounded variables. Invert it into a sample-size statement.
Set the bound to δ: n ≥ ln(2/δ)/(2ε²). E.g., ε = 0.01, δ = 0.05 → n ≥ ln(40)/0.0002 ≈ 18,400. This is the math behind "how many labeled examples to certify accuracy within 1%" — and the template for generalization bounds in learning theory (chapter cross-ref: ML theory's PAC section).
Q5. Batch size 256 → 4096. What happens to gradient noise, and why doesn't training just get 16× better?
Gradient SE shrinks √16 = 4× (variance ∝ 1/B). But optimization benefits saturate at the "critical batch size" where curvature/LR limits, not noise, bind — beyond it you spend 16× compute per step for marginal noise reduction. The CLT explains the dial; the diminishing returns explain why infinite batch isn't optimal.
Q6. P(sum of 100 die rolls > 380) — show the full pipeline.
μ=3.5, σ²=35/12 per roll → sum: mean 350, var 291.7, SD 17.08. z=(380.5−350)/17.08≈1.79 → P≈3.7%. Pipeline: per-unit moments → scale by n → standardize → normal table. Every "P(total exceeds…)" interview question is this pipeline wearing different clothes.
Q7. Your latency data is heavily right-skewed. Is the CI on the MEAN from ±1.96·s/√n valid? And is the mean even the right target?
With n in the thousands, CLT makes the mean's CI fine despite skew (slower convergence — check with bootstrap if n is hundreds). But for latency the mean is usually the WRONG target: tails drive user pain, so report p95/p99 — whose CIs need order-statistic or bootstrap methods, not the z-formula. Distinguishing "valid" from "relevant" is the senior move.
Q8. Why do Cauchy-distributed errors break both the LLN-in-practice and the CLT?
Cauchy has no finite mean or variance: sample means of Cauchy draws are themselves Cauchy — averaging a million observations leaves you exactly as uncertain as one. The theorems' moment conditions aren't lawyer fine print; they're load-bearing. Practical echo: ratio metrics with near-zero denominators behave Cauchy-ish, which is why delta-method/clipping discipline exists.
Q9. Two estimators are unbiased; one has variance 4/n, the other 1/n. Quantify "how much more data" the worse one needs.
Matching variances: 4/n₁ = 1/n₂ → n₁ = 4n₂. The worse estimator needs 4× the data forever — relative efficiency is a data-cost multiplier, which is why estimator choice (chapter 10's efficiency/CRLB) is a budget decision, not an aesthetic one.
Q10. Interviewer: "Give me one sentence on why the Gaussian is everywhere."
Because most observed quantities are sums of many small independent contributions, and the CLT says such sums are Gaussian no matter the pieces — plus it's the maximum-entropy distribution for a fixed mean and variance, so it's also the honest default when you know only those two numbers.
10
PART III · STATISTICS

Estimators: MLE, MAP, and how to judge them

🎯Every number you compute from data is an estimator — the question is whether it is a good one, and MLE gives you the gold standard baseline.

Statistics flips the probability arrow: given data you observed, what do you infer about the parameters that generated it? This chapter builds the vocabulary for judging any estimator (bias, variance, MSE, consistency), derives MLE from first principles for two workhorses — Bernoulli and Gaussian — and then shows how adding a prior converts MLE into MAP, which turns out to be exactly regularized regression in disguise.

What is an estimator?

An estimator θ̂ is any function of the data X₁,…,Xₙ. The key word is function — before data arrives, θ̂ is itself a random variable (different samples → different estimates). That randomness is what bias and variance measure.

Example: you flip a coin 10 times and see 7 heads. One natural estimator of the bias p is θ̂ = 7/10 = 0.7. Another is θ̂ = (7+1)/(10+2) ≈ 0.667 (Laplace smoothing). Which is better? That depends on the loss function and the true p — which is what bias/variance/MSE formalize.

The four properties you judge an estimator by
Bias
$\text{Bias}(\hat\theta) = E[\hat\theta] - \theta$. Zero is unbiased; sign tells you which direction the estimator systematically drifts.
Variance
$\text{Var}(\hat\theta)$. How much do estimates scatter across different samples of the same size?
MSE
$\text{MSE}(\hat\theta) = E[(\hat\theta - \theta)^2]$. The total error. Decomposes cleanly.
Consistency
$\hat\theta \xrightarrow{p} \theta$ as $n\to\infty$. More data → converges to truth. Nearly every reasonable estimator is consistent.
$$\text{MSE}(\hat\theta) = \text{Bias}(\hat\theta)^2 + \text{Var}(\hat\theta)$$
Total squared error = (systematic drift)² + (random scatter). Derive: expand $E[(\hat\theta-\theta)^2]$, add and subtract $E[\hat\theta]$, and collect terms.

Derivation in 3 lines: Let $\mu = E[\hat\theta]$. Then $(\hat\theta - \theta) = (\hat\theta - \mu) + (\mu - \theta)$. Square and take expectation: the cross term vanishes (why? $E[\hat\theta - \mu]=0$), leaving $E[(\hat\theta-\mu)^2] + (\mu-\theta)^2 = \text{Var}(\hat\theta) + \text{Bias}^2$.

⚠ Clears up: the n vs n−1 variance estimator

The sample variance $S^2 = \frac{1}{n-1}\sum(X_i - \bar X)^2$ uses $n-1$ (Bessel's correction). Why not $n$?

Divide by $n$: $\hat\sigma^2_n = \frac{1}{n}\sum(X_i - \bar X)^2$. This estimator is biased: $E[\hat\sigma^2_n] = \frac{n-1}{n}\sigma^2 < \sigma^2$.

Intuition: you compute deviations from $\bar X$, the sample mean — not from the true $\mu$. The sample mean is "too close" to its own data because it is computed from those same points. This absorbs some of the variance, making the estimate systematically small. Dividing by $n-1$ corrects exactly for this one degree of freedom consumed by estimating the mean.

When it matters: for large $n$ the factor $(n-1)/n \approx 1$ and it barely matters. For $n=5$ it shrinks your estimate by 20% — that is a real bias.

MLE: the recipe

Maximum Likelihood Estimation answers: which parameter value makes the data I actually observed most probable?

📐 MLE recipe — memorize this 4-step loop
  1. Write the likelihood $L(\theta) = \prod_{i=1}^n p(x_i;\theta)$ (assuming i.i.d.).
  2. Take the log: $\ell(\theta) = \sum_{i=1}^n \log p(x_i;\theta)$. Products become sums; log is monotone so the argmax is the same.
  3. Differentiate $\frac{d\ell}{d\theta} = 0$ (or gradient = 0 for vector $\theta$).
  4. Solve for $\theta$. Check the second derivative is negative (concavity = maximum, not minimum).

Never: skip the log step — the product form is nearly always intractable to differentiate directly.

Worked: MLE for Bernoulli

Data: $n$ independent coin flips; $k$ heads. Unknown: bias $p \in [0,1]$.

Step 1 — Likelihood:

$$L(p) = p^k (1-p)^{n-k}$$
Each head contributes factor $p$; each tail contributes $(1-p)$; i.i.d. so multiply.

Step 2 — Log-likelihood:

$$\ell(p) = k\log p + (n-k)\log(1-p)$$
Log turns the product into a sum. Now it is just two terms.

Step 3 — Differentiate:

$$\frac{d\ell}{dp} = \frac{k}{p} - \frac{n-k}{1-p} = 0$$
Set derivative to zero; solve for the stationary point.

Step 4 — Solve: $k(1-p) = (n-k)p \Rightarrow k = np \Rightarrow \hat p_{\text{MLE}} = k/n$.

Sanity check: 7 heads in 10 flips → $\hat p = 0.7$. This is just the empirical frequency. MLE formalizes the obvious choice, which is reassuring.

Worked: MLE for Gaussian mean (known variance)

Data: $x_1,\dots,x_n \sim \mathcal{N}(\mu, \sigma^2)$, $\sigma^2$ known. Estimate $\mu$.

Step 1 — Likelihood:

$$L(\mu) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{(x_i-\mu)^2}{2\sigma^2}\right)$$
Product of $n$ Gaussian PDFs, each centered at $\mu$.

Step 2 — Log-likelihood:

$$\ell(\mu) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (x_i - \mu)^2$$
The first term is a constant w.r.t. $\mu$. Maximizing $\ell$ over $\mu$ is equivalent to minimizing $\sum(x_i-\mu)^2$ — i.e., least squares.

Steps 3–4 — Differentiate and solve:

$$\frac{d\ell}{d\mu} = \frac{1}{\sigma^2}\sum_{i=1}^n (x_i - \mu) = 0 \;\Rightarrow\; \hat\mu_{\text{MLE}} = \bar x = \frac{1}{n}\sum x_i$$
The MLE is the sample mean. This is why least-squares regression is the MLE under Gaussian noise.

Key insight: "maximize Gaussian likelihood" and "minimize sum of squared residuals" are the same problem. This is how Gaussian noise assumptions sneak into every regression.

MLE properties (the three free gifts)
Consistency
$\hat\theta_{\text{MLE}} \xrightarrow{p} \theta^*$ as $n\to\infty$ under mild regularity. More data always helps.
Asymptotic normality
$\sqrt{n}(\hat\theta_{\text{MLE}} - \theta^*) \xrightarrow{d} \mathcal{N}(0, I(\theta^*)^{-1})$ — the CLT for estimators. $I(\theta^*)$ is the Fisher information.
Invariance
If $\hat\theta$ is the MLE of $\theta$, then $g(\hat\theta)$ is the MLE of $g(\theta)$ for any function $g$. No chain-rule gymnastics required.
Fisher information and the Cramér-Rao Lower Bound

The Fisher information measures how sharply the likelihood peaks around the true $\theta$ — how much information a single observation carries about $\theta$.

$$I(\theta) = -E\!\left[\frac{d^2}{d\theta^2}\log p(X;\theta)\right] = E\!\left[\left(\frac{d}{d\theta}\log p(X;\theta)\right)^2\right]$$
Fisher information = expected curvature (negative second derivative) of the log-likelihood. A sharp peak (high curvature) means more information.

Plain words: if observing $X$ barely changes how you think about $\theta$ (flat likelihood), information is low. If the likelihood is sharply peaked, every observation is highly informative.

$$\text{Var}(\hat\theta) \geq \frac{1}{n\,I(\theta)}$$
Cramér-Rao Lower Bound: no unbiased estimator can have variance below $1/(nI(\theta))$. The MLE achieves this bound asymptotically — it is efficient.
Sharp vs flat log-likelihood curves: high Fisher information (narrow peak, precise estimate) vs low Fisher information (wide plateau, uncertain estimate).
MAP estimation: MLE + a prior

MLE ignores everything you knew before seeing the data. MAP (Maximum A Posteriori) incorporates a prior belief $p(\theta)$ and maximizes the posterior:

$$\hat\theta_{\text{MAP}} = \arg\max_\theta \underbrace{p(\theta\mid x_1,\dots,x_n)}_{\text{posterior}} = \arg\max_\theta \left[\underbrace{\ell(\theta)}_{\text{log-likelihood}} + \underbrace{\log p(\theta)}_{\text{log-prior}}\right]$$
Maximize log-posterior = log-likelihood + log-prior. MLE is the special case where $\log p(\theta) = \text{const}$ (uniform prior).
Worked: Beta-Bernoulli MAP with pseudo-counts

Suppose you believe the coin is roughly fair before seeing any data. Encode this as a Beta(α, β) prior on $p$.

Prior: $p(\theta) \propto \theta^{\alpha-1}(1-\theta)^{\beta-1}$. With $\alpha=\beta=2$: mild preference for fairness.

Data: $k$ heads in $n$ flips. Likelihood: $\theta^k(1-\theta)^{n-k}$.

Posterior (by Bayes):

$$p(\theta\mid\text{data}) \propto \theta^{k+\alpha-1}(1-\theta)^{n-k+\beta-1} = \text{Beta}(k+\alpha,\; n-k+\beta)$$
Beta prior is conjugate to Bernoulli likelihood — the posterior is also Beta, just with updated parameters. No integral required.

MAP estimate: the mode of Beta$(k+\alpha, n-k+\beta)$ is:

$$\hat p_{\text{MAP}} = \frac{k + \alpha - 1}{n + \alpha + \beta - 2}$$
With $\alpha=\beta=2$: $\hat p_{\text{MAP}} = (k+1)/(n+2)$. Compare to MLE: $k/n$. The prior adds pseudo-counts $\alpha-1$ heads and $\beta-1$ tails to your data.

Numeric example: $n=10$, $k=1$, prior $\alpha=\beta=2$. MLE: $1/10 = 0.10$. MAP: $2/12 \approx 0.167$. The prior pulls toward 0.5 — sensible when data is scarce.

As data grows: with $n=1000$, $k=100$: MAP $= 101/1002 \approx 0.1008$, MLE $= 0.100$. The two pseudo-counts become negligible; MAP → MLE.

Pseudo-count interpretation: $\alpha-1$ is your "imaginary" prior heads, $\beta-1$ your prior tails. Beta(2,2) is like having seen one head and one tail before any real data.

⚠ Clears up: regularization IS a prior — the L2/L1 bridge

In linear regression, we often minimize a regularized loss. Where does the regularization term come from? It is exactly the MAP prior:

  • L2 regularization (Ridge): $\min_w \|Xw - y\|^2 + \lambda\|w\|^2$. This is MAP with a Gaussian prior $w_j \sim \mathcal{N}(0, 1/\lambda)$. Taking $-\log$ of the prior gives $\frac{\lambda}{2}\|w\|^2$ up to a constant.
  • L1 regularization (Lasso): $\min_w \|Xw - y\|^2 + \lambda\|w\|_1$. This is MAP with a Laplace prior $w_j \sim \text{Laplace}(0, 1/\lambda)$. The Laplace prior has heavier tails near zero — it encourages sparsity because the $|w|$ penalty makes exactly-zero solutions optimal at kinks.

Why does this matter? When an interviewer asks "why does L1 encourage sparsity and L2 doesn't?", the Bayesian answer is: the Laplace prior has a sharp peak at zero, so the mode of the posterior tends to have exact zeros. The Gaussian prior is smooth at zero, so it shrinks but does not zero out. This is a principled answer, not a geometric coincidence.

◆ Interview probe

"What happens to the MAP estimate as the sample size grows?" — Answer: the log-prior term becomes negligible relative to the growing log-likelihood; MAP converges to MLE. The prior only matters when $n$ is small relative to the strength of the prior.

"You trained Ridge regression. What prior does that imply?" — Gaussian prior on weights, centered at zero, with variance $1/\lambda$. The regularization strength $\lambda$ encodes how tightly you believe weights should stay near zero.

"Why is MLE biased for the variance estimator?" — The MLE of $\sigma^2$ for a Gaussian is $\frac{1}{n}\sum(x_i-\bar x)^2$, which has expectation $\frac{n-1}{n}\sigma^2$. The bias arises because we estimate $\mu$ from the same data, consuming one degree of freedom.

✓ Remember
  • MSE = Bias² + Variance. Unbiased ≠ best; sometimes a small bias buys large variance reduction.
  • MLE: write likelihood, log it, differentiate, solve. For Bernoulli: $k/n$. For Gaussian mean: $\bar x$ (same as least squares).
  • MAP = MLE + log-prior. With Beta prior on Bernoulli: pseudo-counts shift the MLE. Prior fades as $n\to\infty$.
  • L2 regularization ↔ Gaussian prior. L1 ↔ Laplace prior. Sparsity from L1 = sharp mode of Laplace at zero.
TL;DR

An estimator is any function of data; judge it by bias, variance, MSE, and consistency. MLE picks the parameter that maximizes the probability of your data — for Bernoulli it gives the sample fraction, for Gaussian it gives the sample mean, both provably efficient. MAP adds a prior and becomes MLE in disguise for large $n$; the Beta-Bernoulli case reveals the pseudo-count interpretation. Regularization is MAP: L2 = Gaussian prior, L1 = Laplace prior — the L1 sparsity story is just the Laplace mode being exactly zero.

Tricky interview questions — chapter 10
Q1. What is the difference between bias and variance in an estimator, and why does the trade-off matter?
Bias is the systematic error: $E[\hat\theta] - \theta$, the direction and magnitude of average drift. Variance is the random scatter: how much $\hat\theta$ changes across different datasets. MSE = Bias² + Variance, so you can reduce MSE by accepting some bias if it sharply cuts variance (e.g., Ridge regression). The trade-off matters because minimum-variance unbiased estimators (MVUE) sometimes have higher MSE than biased estimators — the Stein shrinkage result is the canonical example.
Q2. Derive the MLE for the parameter p of a Bernoulli distribution.
Observe $k$ successes in $n$ i.i.d. Bernoulli($p$) trials. Likelihood: $L(p)=p^k(1-p)^{n-k}$. Log-likelihood: $\ell(p)=k\log p + (n-k)\log(1-p)$. Derivative: $k/p - (n-k)/(1-p) = 0$. Solving: $k(1-p)=(n-k)p$, so $\hat p = k/n$. The MLE is the sample frequency — the obvious estimate, now rigorously justified as the maximum of the likelihood surface.
Q3. Why does minimizing sum-of-squared errors in regression correspond to MLE under Gaussian noise?
Assume $y_i = x_i^T w + \epsilon_i$ with $\epsilon_i \sim \mathcal{N}(0,\sigma^2)$. The log-likelihood is $\ell(w) = -\frac{1}{2\sigma^2}\sum(y_i - x_i^T w)^2 + \text{const}$. Maximizing over $w$ is equivalent to minimizing $\sum(y_i - x_i^T w)^2$. So OLS = MLE under Gaussian noise. If you assume Laplace noise instead, you get MAE/L1 regression as the MLE.
Q4. What is the Beta-Bernoulli conjugate pair and why is conjugacy useful?
A prior is conjugate to a likelihood if the posterior is in the same family as the prior. Beta is conjugate to Bernoulli: Beta$(\alpha,\beta)$ prior + $k$ heads in $n$ flips = Beta$(k+\alpha, n-k+\beta)$ posterior. Conjugacy is useful because (1) the posterior has a closed form — no integrals or MCMC needed, and (2) the MAP/mean/median have analytic expressions. In production, conjugate models update in O(1): just increment the sufficient statistics.
Q5. You train Ridge regression with penalty λ. What Bayesian prior does this correspond to, and what happens as λ→0 and λ→∞?
Ridge = MAP with Gaussian prior $w_j \sim \mathcal{N}(0, 1/\lambda)$. As $\lambda \to 0$: prior becomes flat (infinite variance), MAP→MLE, no regularization, risk of overfitting. As $\lambda \to \infty$: prior variance→0, prior says weights must be very close to zero, $\hat w \to 0$ (the null model). Optimal $\lambda$ balances prior belief against data evidence — select by cross-validation, which implicitly maximizes marginal likelihood over held-out data.
Q6. Why does L1 regularization produce sparse weights while L2 does not?
L1 corresponds to a Laplace prior with a sharp, non-differentiable peak at zero. The posterior mode can lie exactly at zero because the subgradient condition allows it. Geometrically, the L1 constraint set is a diamond with corners on the axes — the loss surface typically first touches a corner, giving an exactly-zero coordinate. L2 uses a smooth circular constraint; the loss surface touches it tangentially, never exactly at an axis. Bayesian reading: Laplace has pointmass-like density at zero; Gaussian's density at zero is positive but gentle, so the mode is merely near zero, not at it.
Q7. What does Fisher information measure, and how does it relate to the variance of any estimator?
Fisher information $I(\theta) = -E[d^2\ell/d\theta^2]$ measures the expected curvature of the log-likelihood — how sharply it peaks, i.e., how much information one observation carries about $\theta$. The Cramér-Rao Lower Bound says $\text{Var}(\hat\theta) \geq 1/(nI(\theta))$ for any unbiased estimator. The MLE achieves this bound asymptotically, making it efficient. Practically: $1/\sqrt{nI(\theta)}$ is the asymptotic standard error of the MLE, which is what software packages report.
Q8. How does the MAP estimate behave as the sample size n → ∞?
MAP maximizes $\ell(\theta) + \log p(\theta)$. As $n\to\infty$, $\ell(\theta)$ grows as $O(n)$ while $\log p(\theta)$ is fixed. The prior term becomes negligible relative to the likelihood — MAP converges to MLE. Intuitively, a finite amount of prior belief is washed out by infinite data. The prior only matters when $n$ is small or the prior is extremely strong (very tight). This is why adding a small regularization term is harmless for large datasets.
Q9. What is the MLE of the variance for a Gaussian, and is it biased? By how much?
$\hat\sigma^2_{\text{MLE}} = \frac{1}{n}\sum(x_i-\bar x)^2$. It is biased: $E[\hat\sigma^2_{\text{MLE}}] = \frac{n-1}{n}\sigma^2$, so the bias is $-\sigma^2/n$. The unbiased estimator divides by $n-1$ (Bessel's correction). For $n=10$ the MLE underestimates by 10%. The intuition: deviations are measured from $\bar x$, not the true $\mu$, and $\bar x$ is always at least as close to the data as $\mu$, systematically deflating residuals.
Q10. A new estimator has lower MSE than the MLE but is biased. Would you use it?
Possibly yes. Lower MSE means lower total squared error — if the bias is small and the variance reduction is large, the biased estimator makes more accurate predictions on average. Stein's shrinkage estimator famously dominates the sample mean in dimensions ≥ 3 by exactly this logic. The answer depends on your loss function: if your downstream system penalizes squared error, MSE is the right criterion; if bias causes systematic downstream harm (e.g., unfairness), you might prefer the unbiased estimator despite higher MSE.
11
PART III · STATISTICS

Confidence intervals & the bootstrap

🎯A 95% CI does not mean there is a 95% chance the true value is inside — it means the procedure that built it covers the truth 95% of the time across repeated experiments.

Confidence intervals are the most widely misunderstood output of classical statistics. This chapter builds the z-interval from the CLT step by step, explains the frequentist meaning precisely, contrasts it with what a Bayesian credible interval actually means, covers the t-interval for small samples and the Wilson interval for proportions, and ends with the bootstrap — the universal tool for when the math gets too hard.

What a confidence interval actually means

Suppose you want to estimate a population mean $\mu$. You collect a sample, run a formula, and output an interval $[L, U]$. You call it a "95% confidence interval." What does that guarantee?

The correct statement: The procedure that generated $[L, U]$ is designed so that, if you repeated the entire experiment many times — drawing new samples, computing new intervals each time — 95% of those intervals would contain the true $\mu$.

The interval in your hands is fixed. Either $\mu$ is inside it (probability 1) or it is not (probability 0). You do not know which. The 95% is a property of the procedure, not of this particular interval.

⚠ Clears up: the five most common CI misreadings
  1. Wrong: "There is a 95% chance that $\mu$ lies in $[2.1, 3.4]$." — $\mu$ is fixed and unknown; it is not random. The interval is the random thing.
  2. Wrong: "95% of data points fall in the CI." — The CI is about the mean, not the distribution of individual observations (that would be a prediction interval, which is wider).
  3. Wrong: "If we repeat the study and get a different CI, the first one was wrong." — Both could be valid 95% CI procedures. Across many repetitions, 5% of valid CIs will not contain $\mu$.
  4. Wrong: "A narrower CI is always better." — A CI can be made arbitrarily narrow by using a lower confidence level, or by assuming a distribution that does not hold. Width only makes sense relative to the confidence level and model validity.
  5. Wrong: "p-value ≤ 0.05 ↔ 95% CI excludes zero." — This equivalence holds for two-sided tests of $H_0: \mu=0$ but not in general for other nulls or one-sided tests.
Building the z-interval from the CLT

Start from the CLT: for i.i.d. data with mean $\mu$ and variance $\sigma^2$, the standardized sample mean converges in distribution to standard normal:

$$Z = \frac{\bar X - \mu}{\sigma/\sqrt{n}} \xrightarrow{d} \mathcal{N}(0,1)$$
$\bar X$: sample mean. $\sigma$: population std dev. $n$: sample size. $Z$: the standardized gap between estimate and truth.

Step 1: For large $n$, $P(-1.96 \leq Z \leq 1.96) \approx 0.95$ (standard normal table: 1.96 cuts off 2.5% each tail).

Step 2: Substitute $Z$ and rearrange:

$$P\!\left(-1.96 \leq \frac{\bar X - \mu}{\sigma/\sqrt{n}} \leq 1.96\right) = 0.95$$ $$\Updownarrow$$ $$P\!\left(\bar X - 1.96\frac{\sigma}{\sqrt{n}} \leq \mu \leq \bar X + 1.96\frac{\sigma}{\sqrt{n}}\right) = 0.95$$
Isolate $\mu$ by multiplying through by $\sigma/\sqrt{n}$ and subtracting $\bar X$. The resulting interval is centered on $\bar X$ and has half-width $1.96\,\sigma/\sqrt{n}$.

Result: the 95% z-interval is $\bar x \pm 1.96\,\hat\sigma/\sqrt{n}$, where $\hat\sigma$ estimates $\sigma$ from the data.

Numeric example: 100 users, sample mean session time = 4.2 min, sample std dev = 2.0 min. 95% CI: $4.2 \pm 1.96 \cdot (2.0/\sqrt{100}) = 4.2 \pm 0.39 = [3.81, 4.59]$ minutes.

What moves the width: half-width $= z^* \cdot \sigma/\sqrt{n}$. To halve the CI width, you need $4\times$ the sample size. To target a specific width $w$: $n = (2z^*\sigma/w)^2$.

t vs z: when σ is unknown and n is small

In practice, $\sigma$ is unknown and is replaced by the sample std dev $s$. When $n$ is large, $s \approx \sigma$ and the z-interval works fine. When $n$ is small, the estimation error in $s$ adds extra uncertainty — the t-distribution captures this.

$$T = \frac{\bar X - \mu}{s/\sqrt{n}} \sim t_{n-1}$$
Under normality of $X_i$, this pivotal quantity follows a t-distribution with $n-1$ degrees of freedom. The t-distribution has heavier tails than normal — wider CIs for the same confidence level.
n = 10
Use t-critical value $t_{9, 0.025} = 2.26$ instead of 1.96. CI is 15% wider.
n = 30
t-critical value ≈ 2.04, close to 1.96. Difference is minor.
n ≥ 100
t and z critical values are nearly identical. Use either.

Rule of thumb: use the t-distribution whenever $n < 30$ and the data are reasonably close to normal. For skewed distributions, bootstrap (below) is safer than t for small $n$.

Proportion CIs: Wald vs Wilson

For a proportion $p$ (e.g., click-through rate), the natural CI is the Wald interval: $\hat p \pm z\sqrt{\hat p(1-\hat p)/n}$. This works fine for moderate $p$ and large $n$, but it has a known failure mode:

⚠ Clears up: why Wald fails near 0 and 1

If you observe 0 successes in 20 trials, $\hat p = 0$ and the Wald CI is $[0, 0]$ — a point estimate with zero width. But you have real uncertainty! The true $p$ could be 0.05 and you just got unlucky. The Wald interval is over-confident at the boundaries because it uses $\hat p$ to estimate the variance of $\hat p$, which collapses to zero when $\hat p = 0$.

Wilson interval (the fix): invert the score test rather than the Wald test. The formula:

$$\tilde p \pm \frac{z^2\sqrt{\hat p(1-\hat p)/n + z^2/(4n^2)}}{1 + z^2/n}$$
Where $\tilde p = (\hat p + z^2/(2n))/(1 + z^2/n)$ is the "pulled" center. For 95% CI, $z=1.96$.

In practice: for $\hat p \in [0.1, 0.9]$ and $n \geq 30$, Wald is fine. Use Wilson whenever $n\hat p < 5$ or $n(1-\hat p) < 5$ — i.e., you have few successes or few failures. In A/B testing on low-volume pages (e.g., 2% CTR, $n=100$), always use Wilson.

The bootstrap: resampling as a general CI machine

The CLT-based z-interval requires knowing (or approximating) the distribution of $\hat\theta$. For the mean, the CLT does this. For a median, a ratio, or an AUC, the sampling distribution is complicated. The bootstrap sidesteps this by treating your sample as a proxy for the population and resampling from it.

Core idea: you have one sample of size $n$. You cannot afford another experiment. But you can simulate the experiment by drawing $n$ observations with replacement from your existing sample. Each bootstrap resample is a plausible dataset you might have gotten; computing $\hat\theta$ on each gives you a simulated sampling distribution of $\hat\theta$.

📐 Bootstrap recipe — 4 steps
  1. Compute $\hat\theta$ on original data of size $n$.
  2. For $b = 1, \dots, B$ (typically $B=1000$–$10000$):
    • Draw $n$ samples with replacement from the original data → bootstrap sample $X^{(b)}$.
    • Compute $\hat\theta^{(b)}$ on $X^{(b)}$.
  3. The bootstrap distribution of $\hat\theta^{(b)}$ approximates the sampling distribution of $\hat\theta$.
  4. 95% CI: take the 2.5th and 97.5th percentiles of $\{\hat\theta^{(b)}\}$ (percentile bootstrap), or use $\hat\theta \pm 1.96 \cdot \text{std}(\{\hat\theta^{(b)}\})$ (bootstrap standard error).

Never: resample without replacement — that just gives you the original sample permuted, not a new plausible dataset.

import numpy as np

def bootstrap_ci(data, stat_fn, B=5000, alpha=0.05):
    n = len(data)
    boot_stats = []
    for _ in range(B):
        resample = np.random.choice(data, size=n, replace=True)
        boot_stats.append(stat_fn(resample))
    lower = np.percentile(boot_stats, 100 * alpha / 2)
    upper = np.percentile(boot_stats, 100 * (1 - alpha / 2))
    return lower, upper

# Example: CI for the median
data = np.array([2.3, 1.1, 5.7, 3.2, 4.8, 2.9, 6.1, 1.8])
lo, hi = bootstrap_ci(data, np.median)
print(f"95% bootstrap CI for median: ({lo:.2f}, {hi:.2f})")
When bootstrap shines and when it fails
SituationBootstrap works?Why
Median, quantile, AUCYesNo analytic sampling distribution; bootstrap approximates it well.
Ratio metrics (revenue/user)YesCLT for ratios requires delta method; bootstrap is easier and nearly as accurate.
Weird statistics (Gini, correlation)YesSame reason.
Maximum (or minimum)NoExtremes depend on the tail, which the empirical distribution misrepresents.
Very small n (< 20)CautionEmpirical distribution is a poor proxy for the population; parametric bootstrap (assume a model) is better.
Dependent data (time series)Block onlyi.i.d. resampling breaks dependence structure; use block bootstrap.
◆ Interview probe

"Explain what a 95% CI means to a non-statistician." — "If we ran this study 100 times with new samples each time, about 95 of those intervals would contain the true value. This specific interval either contains it or doesn't — we just don't know which."

"Why is the bootstrap useful?" — It approximates the sampling distribution of any statistic without needing a closed-form formula. Especially valuable for medians, AUC, ratios, or any custom metric where analytic variance formulas are unknown.

"When would you use a t-interval vs a z-interval?" — Use t when $n < 30$ and data are approximately normal (the t accounts for uncertainty in estimating $\sigma$). For large $n$, both give essentially the same answer. For non-normal small samples, bootstrap is safer than either.

✓ Remember
  • 95% CI = the procedure covers truth 95% of repetitions. The specific interval's $\mu$ is either in or out — no probability statement about it.
  • z-interval: $\bar x \pm 1.96\,\hat\sigma/\sqrt{n}$. To halve width, quadruple $n$.
  • Use t instead of z when $n < 30$ and data near-normal. Use Wilson instead of Wald for proportions near 0 or 1.
  • Bootstrap: resample with replacement $B$ times, compute statistic each time, take 2.5th/97.5th percentiles. Fails for extremes and tiny samples.
TL;DR

A 95% CI is a statement about the procedure, not the particular interval: 95% of such intervals cover the truth. Build z-intervals from the CLT ($\bar x \pm 1.96\,\hat\sigma/\sqrt{n}$); use t for small $n$; use Wilson for proportions near 0 or 1. When the statistic is complex (median, AUC, ratio), bootstrap — resample with replacement, compute the stat $B$ times, take percentiles — is the universal fallback, except for extremes and tiny samples.

Tricky interview questions — chapter 11
Q1. Your A/B dashboard shows "CTR = 5.2% ± 0.3% (95% CI)". A PM says "there's a 95% chance the true CTR is between 4.9% and 5.5%." Correct them precisely.
In frequentist terms the true CTR is a fixed number — it's either in the interval or not; no probability attaches to it. The 95% refers to the PROCEDURE: across many repetitions, intervals built this way trap the truth 95% of the time. The PM's sentence is a Bayesian credible-interval statement, which requires a prior. (Practical note: with flat priors and lots of data the two roughly coincide — which is why the misreading rarely hurts, but saying it correctly is the interview point.)
Q2. Build the 95% CI for a mean from n=100, x̄=42, s=10 — and justify z vs t.
SE = s/√n = 1. With n=100, t₀.₀₂₅,₉₉ ≈ 1.984 vs z = 1.96 — nearly identical, so 42 ± 1.96×1 ≈ [40.04, 43.96]. Use t when σ is estimated and n is small (the extra width pays for estimating s); by n ≈ 30+ the difference is cosmetic. Saying "t, but it converges to z here" shows you know both the rule and its materiality.
Q3. Why does the Wald interval for a proportion fail near p = 0 or 1, and what do you use instead?
Wald uses SE = √(p̂(1−p̂)/n), which is 0 when p̂ = 0 — claiming perfect certainty after 0/50 successes, and its coverage oscillates badly for moderate n. Wilson inverts the score test (adds pseudo-counts ~z²/2), keeping intervals inside [0,1] with honest coverage; rule of thumb: Wilson (or Jeffreys/Clopper-Pearson) whenever events are rare — exactly the CTR/fraud regime ML lives in.
Q4. Bootstrap the 95% CI for a MEDIAN in pseudocode, and say why the formula-based approach is hard.
for b in 1..2000: sample n rows WITH replacement; record median_b. CI = [2.5th, 97.5th percentile of the medians]. The median's sampling distribution depends on the unknown density at the median (SE ≈ 1/(2 f(m) √n)) — estimating f is harder than the original problem, so resampling sidesteps it. This is the bootstrap's whole pitch: trade analytic derivation for compute.
Q5. When does the bootstrap FAIL, concretely?
(1) Extremes: max/min — the bootstrap max can never exceed the observed max, so the distribution is mis-shaped. (2) Tiny n: resampling 8 points explores ~few distinct datasets. (3) Heavy tails / infinite variance: resampled means don't stabilize. (4) Dependent data: i.i.d. resampling destroys time/cluster structure — need block or cluster bootstrap (resample whole users, not rows — the A/B-relevant case!).
Q6. Your metric is a ratio (revenue per click). Why is the naive per-row CI wrong, and what are the two fixes?
Both numerator and denominator are random and correlated within users; treating revenue/click rows as i.i.d. ignores that users contribute different numbers of clicks (within-user correlation). Fixes: delta method (variance formula for a ratio of means with their covariance) or cluster bootstrap at the user level. Same disease as randomization-unit mismatch — the unit of analysis must match the unit of independence.
Q7. n=400, CI width is 2 points; the PM wants width 0.5. How many samples, and what's the general law?
Width ∝ 1/√n. Shrinking width 4× needs 16× the samples: n = 6,400. The law: every halving of uncertainty costs quadrupled data — which is why the last decimal of precision is the expensive one, and why experiments are sized from the MDE backwards rather than run until "it looks tight."
Q8. Two 95% CIs overlap. The difference isn't significant, right?
Not necessarily — the overlap test is conservative. The SE of a difference is √(SE₁² + SE₂²) ≈ 1.41×SE (equal case), while requiring non-overlap demands the means be ~2×(1.96×SE) apart. Two CIs can overlap up to ~30% of their width while the difference is still significant at 5%. Always build the CI of the DIFFERENCE — also the cure for eyeballing error bars in papers.
Q9. The "learn more" trap: you compute a CI from the same data you used to select the metric that looked best. What's wrong?
Selection inflates: the winner among k noisy metrics is biased upward (winner's curse), and the CI computed as if pre-chosen doesn't account for the selection event — coverage drops below nominal. Cures: split data (select on one half, estimate on the other), pre-registration, or selective-inference corrections. Same mechanism as multiple testing, wearing an estimation costume.
Q10. Interviewer: "Why ±1.96 — where does it come from, in one breath?"
By the CLT the standardized sample mean is approximately N(0,1); the central 95% of a standard normal lies within ±1.96 (Φ(1.96) ≈ 0.975). Change the confidence level and the quantile moves — 90% → 1.645, 99% → 2.576. The number is just a normal quantile; the CLT is what licenses using it.
12
PART III · STATISTICS

Hypothesis testing without the cargo cult

🎯A hypothesis test is a trial: the null is innocent until the data is too surprising — and the p-value measures the surprise, not the guilt.

Chapters 10–11 built estimators and intervals; this chapter answers the binary question "is this effect real or noise?" You'll get the actual logic of testing (not the ritual), a precise p-value definition with the five classic misreadings, the Type I/II error trade and what moves power, the four workhorse tests each with a worked number, and the multiple-testing trap that quietly invalidates most dashboard "wins." Chapter 13 then puts all of it into production as A/B testing.

The logic: proof by statistical contradiction

Every hypothesis test runs the same four-step program. (1) Assume a boring world — the null hypothesis $H_0$ ("the coin is fair", "the new model is no better"). (2) Pick a test statistic that measures discrepancy from that boring world. (3) Compute how surprising your observed statistic would be if the boring world were true — that probability of surprise is the p-value. (4) If the data is too surprising (p below a pre-chosen threshold $\alpha$), reject $H_0$ in favor of the alternative $H_1$. It's proof by contradiction with probability instead of certainty: we never prove $H_1$; we show $H_0$ strains to explain the data.

The court-trial analogy maps every piece:

Defendant presumed innocent
$H_0$ assumed true until the evidence is overwhelming
Evidence presented
The data, summarized by the test statistic
"Beyond reasonable doubt"
The significance level $\alpha$ — how much doubt we tolerate, fixed before the trial
Guilty verdict
Reject $H_0$ — strong evidence against innocence
Not-guilty verdict
Fail to reject $H_0$ — which is not a proof of innocence, just insufficient evidence
Convicting the innocent
Type I error (false positive), rate controlled at $\alpha$
Acquitting the guilty
Type II error (false negative), rate $\beta$

The asymmetry is deliberate: $H_0$ gets the benefit of the doubt because false discoveries are expensive. That's also why "fail to reject" is the correct verdict language — a not-guilty verdict doesn't mean the defendant was proven innocent.

The p-value, precisely
$$p = P\big(\,T \ge t_{\text{obs}} \;\big|\; H_0 \text{ true}\,\big)$$
p is the probability, computed assuming the null hypothesis is true, that the test statistic T would come out at least as extreme as the value t-obs you actually observed (with "extreme" defined by the alternative: one tail, the other tail, or both).

Tiny example: you suspect a coin favors heads, flip it 100 times, and see 60 heads. Under $H_0$ (fair coin), heads count is roughly $\mathcal{N}(50, 25)$, so $z = (60-50)/5 = 2.0$. One-sided p-value: $P(Z \ge 2.0) \approx 0.023$. Reading: if the coin were fair, only about 2.3% of experiments would show 60 or more heads. At $\alpha = 0.05$ we reject fairness. Note what we did not compute: the probability that the coin is fair.

⚠ Clears up — the top 5 p-value misreadings
  1. "p is the probability $H_0$ is true." No. p is $P(\text{data this extreme} \mid H_0)$, not $P(H_0 \mid \text{data})$ — flipping that conditional needs a prior and Bayes' rule (ch14). A p of 0.04 can coexist with $H_0$ being probably true if true effects are rare.
  2. "p = 0.04 means a 4% chance this finding is a fluke." No. The false-discovery rate among your rejections depends on power and on how often nulls are true — it can easily exceed 30% even when every test uses $\alpha = 0.05$.
  3. "p > 0.05 means no effect / $H_0$ is true." No. Absence of evidence is not evidence of absence — an underpowered test misses real effects routinely. p = 0.4 means "this data can't distinguish the hypotheses," nothing more.
  4. "Smaller p means bigger effect." No. p conflates effect size with sample size: with n in the millions, a 0.01% lift yields p $< 10^{-6}$. Always report the effect size and CI alongside p.
  5. "p is the probability the result replicates." No. Replication probability depends on the true effect and the replication's power; p = 0.05 findings replicate far less than 95% of the time.
Type I, Type II, and the power knobs

Two ways to be wrong: convict the innocent (Type I, probability α — reject a true H0) and acquit the guilty (Type II, probability β — miss a real effect). Power = 1 − β is the probability you detect an effect that's really there.

KnobTurn it…PowerThe price
Sample size nup↑ (SE shrinks as 1/√n)Time and traffic
Effect size δbigger real effectsNot yours to choose — but you CAN target bolder changes
αup (e.g., .05→.10)More false positives
Outcome variance σ²down (CUPED, better metric)Engineering effort

Underpowered studies are double poison: they miss real effects, AND the effects they do "find" are exaggerated (only lucky overestimates cross the bar — the winner's curse).

The working tests, each with one number
One-sample z/t
Is the mean μ₀? t = (x̄ − μ₀)/(s/√n). Example: claim "mean latency 100ms"; n=49, x̄=106, s=14 → t = 6/2 = 3.0, df=48, p ≈ 0.004 → reject. (z when σ known or n large.)
Two-sample t
Do groups A and B differ? t = (x̄_A − x̄_B)/√(s²_A/n_A + s²_B/n_B) (Welch — unequal variances by default). Example: 5.0% vs 5.3% CTR, SE_diff = 0.1pp → t = 3.0 → significant; report the CI of the difference: 0.3pp ± 0.2pp.
Paired t
Same units measured twice (before/after, model A vs B on the SAME queries): test the per-unit differences. Pairing removes between-unit variance — the same trick as interleaving, and why offline model comparisons should always be paired by example.
Chi-square
Counts in categories vs expectation: χ² = Σ(observed − expected)²/expected. Example: 4,880 vs 5,120 users in two arms; expected 5,000 each → χ² = 120²/5000 × 2 = 5.76, df=1, p ≈ 0.016 — that's an SRM alarm, not a coin-flip fluke.
Multiple testing — the silent α factory

Run 20 independent tests at α = .05 on true nulls: P(at least one false positive) = 1 − 0.95²⁰ ≈ 64%, and you EXPECT one "discovery." Segments × metrics × weekly looks multiply into dozens of tests without anyone deciding to test anything.

Bonferroni
Test at α/m each. Controls "any false positive" (FWER). Brutally conservative at large m — power evaporates. Use for a handful of co-primary metrics.
Benjamini-Hochberg
Controls the false discovery RATE: of the things you call significant, ≤5% are false. Sort p-values, find the largest k with p₍ₖ₎ ≤ (k/m)α, call those significant. The right tool for exploratory segment scans.

Rule: FWER (Bonferroni) when a single false claim is expensive; FDR (BH) when you're screening many leads for follow-up.

📐 If asked "is this difference real?" — the rule
  1. Name H0 and the test that matches the data type (means → t; counts → chi-square; paired → paired t).
  2. Check the test's load-bearing assumptions out loud (independence! — clustered data needs clustered methods).
  3. Report three numbers, not one: effect size, its 95% CI, and p. Then judge PRACTICAL significance: a p=0.001 win of +0.01% CTR may be worthless.
  4. Disclose the testing context: how many metrics/segments/looks — and correct accordingly.

Never: say just "p < 0.05 so it's real." That phrase, alone, fails the round.

TL;DR

Hypothesis testing is a controlled-error decision procedure: assume H0, measure how surprising the data is (p), and reject only if it's surprising enough — accepting an α rate of false alarms and a β rate of misses you chose via the power knobs. The p-value is NOT the probability H0 is true, significance is NOT importance, and 20 uncorrected tests manufacture a discovery from nothing. Effect size + CI + p, assumptions checked, corrections disclosed — that trio is the literate answer.

Tricky interview questions — chapter 12
Q1. Define the p-value exactly, then give the two most damaging misreadings.
p = P(data at least this extreme | H0 true) — a statement about data under an assumption. Misreadings: "p is P(H0 true)" (that's a posterior; needs a prior — Bayes' territory) and "1−p is P(the effect replicates)" (replication depends on true effect and power; p=0.05 results replicate far less than 95% of the time). Getting the conditional's direction right is the entire definition.
Q2. Why does an underpowered study EXAGGERATE the effects it finds?
With power 30%, only samples whose noise happened to inflate the effect cross the significance bar — conditioning on significance selects the overestimates (winner's curse / Type M error). So small studies that "found" something report effects that won't reproduce at full size. This is also why launch decisions made on barely-significant small tests systematically disappoint at 100%.
Q3. n=49, x̄=106ms, s=14ms against the claim μ=100ms. Run the test and state the conclusion in plain words.
SE = 14/7 = 2; t = (106−100)/2 = 3.0 at df=48 → p ≈ 0.004 two-sided; 95% CI = 106 ± 4.0 = [102, 110]. Plain words: if the true mean were 100, data like ours would occur ~0.4% of the time — the claim is inconsistent with observation, and the latency exceedance is 2-10ms, which is the number the SLO conversation actually needs.
Q4. When is a paired design dramatically better than two-sample, and what's the ML instance?
When between-unit variance dwarfs the effect: pairing differences it away. ML instance: comparing models A and B, evaluate BOTH on the same test examples and test the per-example score differences (paired t / Wilcoxon) — example difficulty (the dominant variance) cancels. Two-sample comparison of A on one random half vs B on the other throws that cancellation away and can need 10-100× the data.
Q5. Your arm counts are 50.6%/49.4% on 100k users. Coincidence?
Chi-square: expected 50k each, χ² = 600²/50000 × 2 = 14.4, p ≈ 0.00015 — not coincidence; that's a sample-ratio mismatch. Stop analyzing outcomes and find the assignment/logging bug; non-random exclusion confounds every metric. (This question doubles as a check that you can actually run a chi-square in your head: (O−E)²/E summed.)
Q6. One-sided or two-sided test for an A/B — and the abuse to watch for?
Two-sided by default: regressions matter as much as wins. One-sided is legitimate only when the OTHER direction is decision-irrelevant AND it was declared before data — flipping to one-sided after seeing the direction is α-doubling dressed as methodology. In practice platforms run two-sided and the interview answer is "two-sided, pre-declared, here's the one legitimate exception."
Q7. Why does independence failure (clustering) wreck a t-test, and what's the magnitude?
n correlated observations carry fewer than n units of information; the naive SE divides by √n anyway, understating uncertainty by the design-effect factor √(1 + (m̄−1)ρ) (m̄ = cluster size, ρ = intra-cluster correlation). With 100-event users at ρ=0.1, SEs are ~√10.9 ≈ 3.3× too small — p-values catastrophically optimistic. Cure: analyze at the cluster (user) level or use cluster-robust errors. This is the same bug as request-level randomization, found at analysis time.
Q8. BH vs Bonferroni on 100 segment tests with 5 true effects — what happens under each?
Bonferroni tests at .0005 each: false positives ~0, but power collapses — you likely find 1-2 of the 5. BH adapts: with several small p-values present, its threshold relaxes, typically recovering most true effects while keeping ≤5% of DISCOVERIES false. Screening many leads → BH; certifying a single claim → Bonferroni. Saying which error currency each controls (FWER vs FDR) is the point.
Q9. "Statistically significant but practically meaningless" — construct the example and the decision rule.
n = 50M/arm: a +0.005% CTR (relative 0.1%) yields p < 0.001 — real beyond doubt, worth nothing against the feature's complexity cost. Decision rule: pre-set a minimum practical effect (the MDE should BE that number); ship logic compares the CI against the practical threshold, not against zero. Big data makes p→0 for everything; effect sizes are the adult currency.
Q10. Interviewer hands you p = 0.04 and asks "so there's a 96% chance the feature works?" Give the 30-second correction with a Bayesian bridge.
No — p conditions on H0, not on the data. To get P(effect | data) you need a prior: if only ~10% of features tried truly work, then by Bayes even a p=0.04 result might be a false alarm with probability ~30-40% (depending on power) — which matches the lived experience that "significant" launches often fizzle. If the stakeholder wants P(works), run the Bayesian readout (chapter 14) with an honest prior; if they want error control over many launches, p-values with discipline do that job.
13
PART III · STATISTICS

A/B testing: statistics meets production

🎯An A/B test is a hypothesis test wearing a hard hat: the math is Chapter 12, but the failures are all in the plumbing — what you randomize, when you look, and which segment table you believe.

This chapter walks one experiment end to end — metric choice, randomization unit, a real sample-size calculation, the readout, the decision — then catalogs the deadly sins that invalidate tests in practice (peeking, metric fishing, Simpson's paradox, novelty effects) and the production-grade upgrades: CUPED variance reduction, the delta method for ratio metrics, and cluster randomization for interference. This is the most interview-loaded chapter in Part III: every product data-science and ML loop asks some version of "design an A/B test for X."

Step 1 — Pick the OEC and the guardrails before you touch any data

The OEC (Overall Evaluation Criterion) is the single metric the test is allowed to move and the one your ship/no-ship decision keys on. Choosing it after seeing results is the original sin of experimentation — with 20 metrics on a dashboard, one will be "significant" at α=0.05 by luck alone (Chapter 12's multiple-testing arithmetic, now with a shipping decision attached).

A good OEC is (a) sensitive enough to move in a 2-week window, (b) causally close to the change, and (c) aligned with long-term value. Revenue is aligned but noisy and slow; click-through rate is sensitive but gameable (clickbait raises CTR while destroying trust). Running example for this chapter: a new ranking model for a feed, OEC = session click-through rate proxy: fraction of users who click at least one recommended item, baseline 5.0%.

Guardrails are metrics that must NOT regress: latency (p95), crash rate, unsubscribe rate, ad revenue. They get tested too — usually as non-inferiority checks ("rule out a drop bigger than 1%") rather than "detect any improvement." A test that wins the OEC but trips a guardrail does not ship.

Step 2 — Choose the randomization unit (this decides what your test means)

The randomization unit is the thing you flip the coin for. Two common choices:

User-level
Each user is hashed into control or treatment once and stays there. The unit of analysis matches the unit of randomization, so standard variance formulas are valid. Default choice.
Session/request-level
Each session (or query) is independently assigned. More statistical power (more units), but one user sees both experiences — bad for UI changes — and sessions from the same user are correlated, so if you analyze sessions as if independent, your standard errors are too small and false positives spike.
⚠ Clears up

"Randomization unit must be ≥ analysis unit." If you randomize by user but report a per-session metric, sessions within a user are correlated and the naive SE is wrong (too optimistic). Fix with the delta method (below) or bootstrap over users. Interviewers love this trap: "we randomized users but measured per-impression CTR — what's wrong with the t-test?"

The deeper assumption behind any unit choice is SUTVA — no interference: my treatment assignment doesn't change your outcome. Feeds with social sharing, marketplaces, and ride-sharing all violate it (treatment users buying up inventory makes control look worse, inflating the measured effect). We return to this at the end with cluster randomization.

Step 3 — Sample size, worked with real numbers

Plain words first: the smaller the effect you want to detect, the more users you need, because you must shrink the noise (standard error ∝ $1/\sqrt{n}$) until the effect stands out from it. The MDE (minimum detectable effect) is the smallest lift you'd care about shipping — a product decision, not a statistics one. Ours: baseline conversion $p_1 = 5.0\%$, MDE = 0.2 percentage points (so $p_2 = 5.2\%$, a 4% relative lift), two-sided $\alpha = 0.05$, power $1-\beta = 80\%$.

$$n \;=\; \frac{(z_{\alpha/2} + z_{\beta})^2\,\big[p_1(1-p_1) + p_2(1-p_2)\big]}{(p_2 - p_1)^2}$$
Users needed per arm. $z_{\alpha/2}=1.96$ is the significance bar (how surprising the data must be), $z_{\beta}=0.84$ is the power requirement (how reliably we want to clear that bar when the effect is real), the bracket is the combined Bernoulli variance of the two arms, and the denominator is the squared effect we're hunting. Halving the MDE quadruples $n$.

Plug in, step by step:

  1. $(z_{\alpha/2}+z_{\beta})^2 = (1.96+0.84)^2 = 2.8^2 = 7.84$
  2. $p_1(1-p_1) = 0.05 \times 0.95 = 0.0475$;   $p_2(1-p_2) = 0.052\times 0.948 = 0.0493$; sum $= 0.0968$
  3. $(p_2-p_1)^2 = (0.002)^2 = 4\times 10^{-6}$
  4. $n = \dfrac{7.84 \times 0.0968}{4\times 10^{-6}} \approx \dfrac{0.7589}{0.000004} \approx 189{,}700$ per arm → round up: ≈190,000 per arm, ≈380,000 users total.

Sanity-check the size: detecting a 0.2pp move on a 5% base rate is hunting a 4% relative change in a coin that comes up heads 1 time in 20 — of course it takes ~400k flips. If the site randomizes ~40,000 eligible users/day at full allocation, that's ~10 days of traffic; run 14 days anyway to cover two full weekly cycles (weekend users behave differently than weekday users).

◆ Interview probe

"What happens to $n$ if the MDE were 0.1pp instead?" Effect halves → denominator shrinks 4× → $n \approx 760{,}000$ per arm. Interviewers check you know $n \propto 1/\text{MDE}^2$, and that you treat MDE as the smallest effect worth shipping, not a guess at the true effect.

Steps 4–6 — Run, analyze, decide

Run: hash user IDs (with an experiment-specific salt — reusing one hash across experiments correlates assignments) into 50/50 arms; before unblinding the OEC, run an SRM check (sample-ratio mismatch): a chi-square test that the realized split is actually 50/50. Getting 189,000 vs 191,000 when you expected equal arms is a red flag for broken assignment or differential logging loss — the most common silent killer of real experiments. SRM fails → throw the test away, no exceptions.

Analyze: after 14 days, control: 9,500 conversions / 190,000 ($\hat p_C = 5.00\%$); treatment: 9,956 / 190,000 ($\hat p_T = 5.24\%$). Two-proportion z-test (Chapter 12):

$$z = \frac{\hat p_T - \hat p_C}{\sqrt{\bar p(1-\bar p)\left(\tfrac{1}{n_T}+\tfrac{1}{n_C}\right)}} = \frac{0.0024}{\sqrt{0.0512\cdot 0.9488 \cdot \tfrac{2}{190{,}000}}} = \frac{0.0024}{0.000715} \approx 3.36$$
Observed lift (0.24pp) divided by the standard error of the difference under the pooled rate $\bar p = (9{,}500+9{,}956)/380{,}000 = 0.0512$. A z of 3.36 means the lift is 3.36 standard errors from zero; two-sided p ≈ 0.0008.

Decide: report effect size with uncertainty, never p alone: lift = +0.24pp, 95% CI [+0.10pp, +0.38pp] ($0.0024 \pm 1.96\times 0.000715$), p ≈ 0.0008. The CI sits entirely above zero and its lower end (+0.10pp) — check it against the MDE bar and launch cost. Guardrails clean, SRM clean → ship. Note the honest caveat: the CI is wide relative to the point estimate; the true lift could plausibly be half or 1.5× what we measured.

Deadly sin #1 — Peeking: why looking early inflates false positives

The α=0.05 guarantee is for one look at the data. Peek daily and decide "stop, it's significant!" whenever $|z|>1.96$, and you are running many correlated tests, taking the maximum excursion of the z-statistic over time. Under $H_0$ the cumulative z wanders like a random walk around 0 — give it 10 chances to touch ±1.96 and it will, far more than 5% of the time. Simulation: checking 10 times during a null experiment and stopping at the first "significant" result yields a false-positive rate of ≈19%, not 5%. Peek continuously forever and the random walk crosses any fixed bar with probability → 1: every flat null "wins" eventually.

Cumulative z-statistic of a null A/B test over 14 days, wandering as a random walk and briefly crossing the ±1.96 band around day 5 — a stop-at-first-significance rule would have declared a false win.

The asymmetry that makes peeking poisonous: a true effect crossed-then-uncrossed still ends significant at day 14, but a null that grazes the line gets frozen as a win the moment you stop. Stopping rules harvest noise. Fixes, in order of practicality:

  • Don't peek — fix the horizon at design time, look once. Simple, valid, what most teams should do.
  • Alpha-spending / group-sequential tests (O'Brien–Fleming, Pocock): pre-plan K looks and spend the 5% budget across them with wider early bars (e.g., first look needs |z| > 4, final look ≈ 2.0). Early stopping for huge wins/harms stays legal.
  • Always-valid inference (mSPRT / confidence sequences, the "anytime-valid" machinery behind Optimizely-style dashboards): intervals that hold simultaneously over all times, so you may stop whenever you like — at the price of wider intervals.
Deadly sins #2–4 — Metric fishing, Simpson's paradox, novelty

HARKing on 20 metrics. The flat-OEC test where "engagement among Android users aged 25–34 is up, p=0.03!" With 20 metrics × a handful of segments, dozens of tests run implicitly; at α=0.05 you expect spurious wins (Chapter 12). Rule: one pre-registered OEC decides; secondary metrics and segments generate hypotheses for the next test, or must survive an FDR correction.

Simpson's paradox. Segment readouts can tell the exact opposite story from the aggregate when arm composition differs across segments. Concrete table — treatment loses in every segment yet wins overall:

SegmentControlCTRTreatmentCTR
Mobile90 / 1,0009.0%765 / 9,0008.5%
Desktop360 / 9,0004.0%38 / 1,0003.8%
Overall450 / 10,0004.5%803 / 10,0008.03%

Treatment is worse on mobile (8.5% < 9.0%) and worse on desktop (3.8% < 4.0%), yet "wins" overall (8.03% > 4.5%) — purely because the treatment arm is 90% mobile (the high-CTR segment) while control is 90% desktop. Nothing here is a math error; the aggregate is confounded by segment mix. In a properly randomized test both arms get the same mix, so this table is itself diagnostic: composition this lopsided means broken randomization (and the SRM check per segment would have caught it — e.g., a staged rollout that ramped treatment on mobile first).

Novelty and primacy effects. A shiny new button gets clicked because it's new (novelty: effect decays); a redesigned flow looks bad because users must relearn it (primacy: effect grows). Both make week-1 results lie about steady state. Diagnose by plotting the treatment effect by day or by user-exposure cohort: a real effect is flat, novelty slopes down, primacy slopes up. Mitigations: run ≥2 weeks, weight later days, or analyze only users past their first few exposures.

CUPED, ratio metrics, and interference — the three "senior" topics
CUPED
Users differ hugely in baseline behavior, and that variance drowns small effects. CUPED subtracts the predictable part: ŷ = y − θ(x_pre − x̄_pre), where x_pre is the SAME metric measured pre-experiment and θ = Cov(y, x_pre)/Var(x_pre). Variance drops by the squared correlation — pre-period CTR with ρ ≈ 0.7 cuts variance ~50%, i.e., half the sample size for the same power, for free.
Ratio metrics
CTR = Σclicks/Σviews is a ratio of two random, correlated, user-clustered quantities. The naive per-row variance is wrong; the delta method gives Var(A/B) ≈ (μ_A/μ_B)²[Var(A)/μ_A² + Var(B)/μ_B² − 2Cov/(μ_Aμ_B)], computed with user-level aggregates. Or cluster-bootstrap users. Wrong unit ⇒ overconfident CIs — the most common real-world A/B bug.
Interference
SUTVA assumes my treatment doesn't affect your outcome — false in marketplaces (treated buyers deplete shared inventory), social products (treated users message control users), and ads (shared budgets). Cure: randomize CLUSTERS that contain the spillovers — geos, social communities, time slices (switchback) — trading power for validity.
📐 "Design an A/B test for feature X" — the 8-step recital
  1. Hypothesis & OEC: one primary metric, declared before launch.
  2. Guardrails: latency, integrity, revenue — must-not-regress bounds.
  3. Randomization unit: user (default); cluster if interference; consistent hashing for stable assignment.
  4. Power & duration: n from baseline rate, MDE, α=.05, power=.8 — and run ≥2 full weeks regardless, for weekly cycles and novelty decay.
  5. Variance reduction: CUPED with pre-period data; trigger-based analysis if only some traffic is affected.
  6. Run discipline: no peeking (or pre-chosen sequential boundaries); SRM check (sample-ratio mismatch = broken assignment, invalidates everything).
  7. Analysis: effect + CI (not just p), guardrail readout, key segments (with multiple-testing care), days-curve for novelty.
  8. Decision: pre-agreed ship/iterate/abandon criteria; long-term holdback if the change is big.

Never: answer with only "split traffic 50/50 and t-test" — steps 2, 3, 6 are where real experiments die.

TL;DR

A/B testing is the CLT industrialized — but the statistics are the easy half. The hard half is discipline: pick the OEC and guardrails first, randomize at the unit where interference lives, size from the MDE (5%→5.2% CTR needs ~100k/arm, not 1k), never peek without sequential corrections, check SRM before believing anything, and read segments as hypotheses, not conclusions (Simpson's paradox is real and routine). CUPED is the free lunch; interference is the silent killer; the 8-step recital covers both.

Tricky interview questions — chapter 13
Q1. Size this test: baseline CTR 5%, MDE 0.2pp (5.0→5.2%), α=0.05, power 80%. Ballpark n per arm.
n ≈ 2(z₀.₉₇₅+z₀.₈)²·p̄(1−p̄)/δ² = 2(1.96+0.84)²(0.051)(0.949)/0.002² ≈ 2×7.84×0.0484/0.000004 ≈ 190k per arm. The shape to remember: n scales with p(1−p) and inversely with MDE² — halving the detectable effect quadruples the traffic bill. Quoting the formula's structure matters more than the exact constant.
Q2. Why does peeking inflate false positives — give the intuition and the legal alternatives.
Each look is another chance for noise to cross the threshold; "stop when significant" turns a 5% test into 25-40% false positives over many looks (the p-value path is a random walk that WILL touch 0.05 sometime under H0). Legal: fixed-horizon analysis, group-sequential boundaries (O'Brien-Fleming/alpha-spending), or always-valid inference (mSPRT) built into modern platforms. "We peeked but only stopped once" is not on the list.
Q3. Construct a Simpson's-paradox readout: treatment wins overall but loses every segment. Numbers, please.
New users: control 100/1000 = 10%, treatment 180/2000 = 9% (loses). Power users: control 600/2000 = 30%, treatment 290/1000 = 29% (loses). Overall: control 700/3000 = 23.3%, treatment 470/3000 = 15.7%… flip the mix: treatment skews to power users → overall treatment (180+290)/3000 vs control… the honest construction: treatment loses within each segment but ships MORE traffic into the high-rate segment, so its blended average wins. The lesson: blended metrics confound effect with mix — and randomization exists precisely to equalize mix; if segments shifted, check SRM/triggering before believing anything.
Q4. What is SRM and why does it invalidate the whole experiment rather than just adding noise?
Sample-ratio mismatch: you configured 50/50 but observe 50.6/49.4 with p < 0.001 under binomial — assignment or logging is broken (bots filtered asymmetrically, redirects dropping one arm, caching). It's not noise; it's evidence units were EXCLUDED non-randomly, so the arms differ in composition, not just treatment — every downstream comparison is confounded. SRM check is the first gate of any readout; failing it means investigate, never analyze.
Q5. CUPED with ρ = 0.6 between pre-period and experiment metric: what's the variance reduction and the sample-size implication?
Variance shrinks by ρ² = 36% → same power with 64% of the sample, or ~25% tighter CIs at fixed n. Requirements: pre-period data for the same units (new users have none — CUPED covers only returning users; analyze strata separately) and θ estimated from pooled data to stay unbiased. It's the highest-ROI line of code in experimentation platforms.
Q6. Marketplace test: treated buyers get better search. Why is user-randomization wrong, and what's the design?
Treated buyers purchase inventory control buyers would have bought — control outcomes are CONTAMINATED (treatment "wins" partly by cannibalizing control), violating SUTVA and typically overstating the effect. Designs: cluster by market/geo (whole city treated), switchback (whole marketplace alternates by time block, analyze with time-cluster errors), or buyer-AND-seller-side designs for two-sided effects. Naming cannibalization without prompting is the staff signal.
Q7. The treatment effect is +4% in week 1, +1.5% in week 2, +0.3% in week 3. Read it.
Classic novelty decay — users explore the new thing, then habituate. Extrapolating week 1 ships a mirage. Read the asymptote (or a cohort-by-exposure-age curve) as the durable effect; if it trends to ~0, the feature changed BEHAVIOR TIMING, not value. The inverse pattern (primacy — starts negative, improves) means learning costs; both argue for ≥2-week runs and against early stopping on excitement.
Q8. You ran 20 metrics; "time in app for Android tablet users" hit p=0.03. Ship the story?
Expected false positives at α=.05 across 20 tests ≈ 1 — this is likely it. Segments×metrics multiply silently into dozens of tests; BH/FDR control or treating segment hits as HYPOTHESES for a confirmatory follow-up is the discipline. The pre-declared primary metric exists precisely so this conversation is short.
Q9. Your metric is revenue-per-user, which is 95% zeros and wildly skewed. Does the t-test even work, and what's better?
With n in the tens of thousands per arm, CLT rescues the t-test on means even for zero-inflated skew (verify with bootstrap on smaller n). Better practice: winsorize/cap extreme spenders (pre-declared cap) to tame variance, analyze P(purchase) and revenue-given-purchase separately (decomposes the effect), and consider quantile treatment effects if the business cares about whales vs typical users differently.
Q10. PM: "The test is at p=0.06 after the planned run. Can we extend it a week to get it over the line?"
Extending BECAUSE it's near-significant is optional stopping in reverse — it inflates α just like peeking (you only extend the promising ones). Correct options: pre-registered group-sequential design next time; treat this run as evidence (effect + CI) and make the ship decision on effect size and cost-benefit, not the threshold; or rerun as an independent confirmation. Also say the quiet part: p=0.06 vs 0.05 is not a scientific boundary — the CI tells the PM what range of effects is consistent with the data.
14
PART III · STATISTICS

Bayesian statistics in practice

🎯Bayesian inference doesn't make probability "subjective" — it makes uncertainty a first-class citizen you can update the moment new data arrives.

This chapter closes the statistics part by grounding Bayesian reasoning in the methods that actually run in production: Beta-Bernoulli conjugacy, credible intervals, Bayesian A/B testing, Thompson sampling, and empirical-Bayes shrinkage. Every technique here solves a concrete problem that frequentist tools handle awkwardly or not at all — small-sample multi-arm decisions, continuous peeking, and pooling estimates across segments.

Frequentist vs Bayesian — one honest table

The debate often generates more heat than light. Here is what actually differs:

QuestionFrequentistBayesian
What is "probability"?Long-run frequency of repeatable eventsDegree of belief, updated by evidence
What is random?The data (θ is fixed, unknown)Both data and θ (θ has a distribution)
What does a 95% interval mean?The procedure covers the true θ in 95% of repeated experimentsThere is 95% posterior probability that θ lies in this interval
How is prior knowledge used?Not formally — only via study designExplicitly as a prior distribution
Result formatPoint estimate + CI + p-valueFull posterior distribution
When it shinesLarge-n, well-designed experiments, regulatory settingsSmall n, sequential decisions, many parameters, need for uncertainty quantification
When it strugglesSequential peeking; small-n many-group problemsPrior choice matters; computation can be expensive
⚠ Clears up

"Bayesian is subjective, frequentist is objective" — Both require choices. Frequentist analysis requires choosing the test statistic, stopping rule, and significance threshold. Bayesian analysis requires choosing the prior. Neither is assumption-free. The real question is: which assumptions are easiest to make explicit and defend for this problem?

Conjugacy: Beta-Bernoulli fully worked

A conjugate prior is a prior whose functional form is preserved under the likelihood — the posterior belongs to the same family. This turns inference into simple parameter arithmetic.

Setup. We are estimating the bias θ of a coin. We choose a Beta prior because it lives on [0,1] and has a natural pseudo-count interpretation:

$$\theta \sim \mathrm{Beta}(\alpha, \beta)$$
θ is our unknown probability; α and β are the prior parameters — think of them as "imaginary heads" and "imaginary tails" observed before the experiment begins.
$$p(\theta) = \frac{\theta^{\alpha-1}(1-\theta)^{\beta-1}}{B(\alpha,\beta)}, \quad \theta \in [0,1]$$
B(α,β) is the Beta function (a normalizing constant). The mean of Beta(α,β) is α/(α+β).

Choosing the prior. We start with Beta(2,2) — weakly centered on 0.5, expressing mild belief the coin is fair. The mean is 2/(2+2) = 0.5. This is equivalent to having seen 1 imaginary head and 1 imaginary tail (the shape α−1 = 1, β−1 = 1).

Observing data. We flip the coin 10 times and see 7 heads, 3 tails. The likelihood of this data given θ is Binomial:

$$p(\text{data} \mid \theta) \propto \theta^{7}(1-\theta)^{3}$$
We ignore the binomial coefficient because it doesn't depend on θ.

Updating (Bayes' rule). Multiply prior by likelihood:

$$p(\theta \mid \text{data}) \propto \theta^{\alpha-1}(1-\theta)^{\beta-1} \cdot \theta^{7}(1-\theta)^{3} = \theta^{(\alpha+7)-1}(1-\theta)^{(\beta+3)-1}$$
Exponents add. This is exactly the kernel of a Beta distribution with updated parameters.
$$\theta \mid \text{data} \sim \mathrm{Beta}(\alpha + 7,\; \beta + 3) = \mathrm{Beta}(9, 5)$$
Prior parameters α=2, β=2 updated by heads=7, tails=3. The posterior is Beta(9,5).

What changed? The posterior mean is 9/(9+5) = 0.643. The prior mean was 0.5; the MLE would be 7/10 = 0.7. The posterior sits between them, pulled toward the MLE by data but tempered by the prior. With only 10 flips, the prior still matters.

Prior Beta(2,2) → likelihood peak at 0.7 → posterior Beta(9,5): the distribution narrows and shifts rightward as data accumulates.
Pseudo-count intuition

The conjugate update rule says: posterior parameters = prior parameters + data counts. This gives a clean mental model called pseudo-counts:

Beta(α, β) prior
Equivalent to having already seen α−1 imaginary heads and β−1 imaginary tails before data collection starts.
Beta(1,1) — uniform
Zero pseudo-counts: total ignorance, every θ equally plausible.
Beta(2,2)
1 pseudo-head + 1 pseudo-tail: mild centering on 0.5, effective sample size 2.
Beta(10,10)
Strong prior: 9 pseudo-heads + 9 pseudo-tails; data must overcome 18 virtual observations.
As n → ∞
The pseudo-counts become negligible, and the posterior mean converges to the MLE: α/(α+β) → (α + heads)/(α + β + n) → heads/n.

This is why regularization and Bayesian priors are the same thing in different clothes: L2 regularization on logistic regression is exactly MAP inference under a Gaussian prior; a Beta(α,β) prior on a Bernoulli rate is exactly adding α+β−2 pseudo-observations.

Credible intervals vs confidence intervals
⚠ Clears up

These sound identical but mean entirely different things.

  • Frequentist 95% confidence interval: A procedure that, if repeated on many datasets, would contain the true θ in 95% of cases. For this specific interval, θ is either in it or not — no probability statement about this interval is valid. The randomness is in the procedure, not in θ.
  • Bayesian 95% credible interval (highest-density interval, HDI): Given this data and this prior, there is a 95% posterior probability that θ lies in this interval. The randomness is in θ (it has a posterior distribution). This is the statement people intuitively want to make.

Numeric example. Beta(9,5) posterior for our coin: the 95% equal-tailed credible interval is approximately [0.38, 0.86]. Correct interpretation: "Given our prior and 7/10 heads, we believe θ ∈ [0.38, 0.86] with 95% probability." Wrong interpretation of a frequentist CI: "There's a 95% chance the true θ is in [0.42, 0.98]."

In practice: with flat (non-informative) priors and large n, credible and confidence intervals are numerically close. The difference matters most in small-n settings and when the prior is informative.

Bayesian A/B testing: P(B > A) directly

Classical A/B testing answers: "Is the difference statistically significant?" Bayesian A/B testing answers: "What is the probability that B is better than A, and by how much?"

Setup. Variant A has prior Beta(αA, βA) and observes nA conversions from NA trials. Variant B similarly. After observing data:

$$\theta_A \mid \text{data} \sim \mathrm{Beta}(\alpha_A + c_A,\; \beta_A + f_A)$$
c_A = conversions for A, f_A = failures for A. Same update as the coin example.
$$P(\theta_B > \theta_A) = \int_0^1 p(\theta_B) \int_0^{\theta_B} p(\theta_A)\, d\theta_A\, d\theta_B$$
The probability that a draw from B's posterior exceeds a draw from A's posterior. This integral has a closed form for Beta distributions; otherwise estimate by Monte Carlo in <10 lines.

Concrete example. Start both arms at Beta(1,1). After the experiment, A has 40 conversions / 200 trials (posterior Beta(41,161)), B has 52 / 200 (posterior Beta(53,149)). Sample 100k draws from each posterior; compute the fraction of times θB > θA. If that fraction is 0.97, you say: "There is a 97% posterior probability that B's true rate exceeds A's."

Pros of Bayesian A/B
Direct probability statement about which is better; no peeking problem (the posterior is always valid); can incorporate prior knowledge about baseline rates; gives the full distribution of the lift, not just "significant or not".
Cons of Bayesian A/B
Prior choice matters, especially at small n; "97% probability B is better" is not the same as controlling false-positive rate at 3% (depends on prior); harder to audit for regulatory/legal settings that require classical p-values; teams less familiar with Bayesian reasoning may misinterpret results.
The "no peeking" advantage
Classical tests have inflated type-I error if you peek and stop early. Bayesian posteriors are coherent at every sample size — P(B > A) at 50 observations is a valid probability statement, not a multiple-testing violation. You can stop when the posterior satisfies your decision criterion (say, >95% probability of winning, or expected loss < 0.1%).
Thompson sampling — the posterior as a decision engine

Choosing between arms (ads, headlines, ranking tweaks): sample each arm's conversion rate from its posterior, play the arm whose SAMPLE wins, update. Uncertain arms get explored exactly in proportion to their probability of being best — exploration falls out of Bayes for free.

« 8 lines, the whole algorithm »
alpha = [1, 1]; beta = [1, 1]          # Beta(1,1) priors per arm
for t in range(100000):
    theta = [random.betavariate(alpha[k], beta[k]) for k in (0, 1)]
    k = 0 if theta[0] > theta[1] else 1   # play the sampled best
    reward = pull(k)                       # 0 or 1
    alpha[k] += reward
    beta[k]  += 1 - reward

Production cameos: ad/creative selection, artwork choice, explore slots in recommenders. Strengths: no peeking problem, automatically tapers exploration. Caveats: delayed/batched rewards need care, and nonstationary rates need discounting or sliding windows.

Empirical Bayes shrinkage — the batting-average lesson

Early-season: one player bats 9/20 (.450), another 60/200 (.300). Best guess at true skill? NOT the raw averages — small samples scatter wildly around true skill (chapter 8's law of total variance). Fit a prior from ALL players (league skill ≈ Beta centered ~.27), then each player's posterior mean = weighted blend of their data and the league prior, weights ∝ sample size. The 9/20 hitter shrinks hard toward .27; the 60/200 hitter barely moves. Stein's lesson: shrinking EVERYONE toward the center beats trusting each raw average — uniformly.

Recsys cash-out: item CTR = (clicks + α)/(impressions + α + β) with α, β fit from the catalog — exactly this machinery; "smoothed CTR" is empirical Bayes wearing a production name.

📐 If asked "Bayesian or frequentist for this problem" — the rule
  1. Lots of data, repeated decisions, error-rate guarantees wanted → frequentist machinery is simpler and audited (A/B platforms).
  2. Small data, strong prior knowledge, or sequential decisions → Bayesian (cold-start priors, bandits, hierarchical estimates).
  3. Stakeholder asks "probability this is better?" → that's a posterior; answer it with Bayes or translate the question.
  4. Always state the prior and show sensitivity — a Bayesian answer whose prior is hidden is just opinion with integrals.
TL;DR

Bayes turns priors plus data into posteriors; conjugacy (Beta-Binomial) makes the update arithmetic — pseudo-counts in, posterior out, MAP→MLE as data swamps the prior. Credible intervals say what everyone wishes confidence intervals said. Thompson sampling turns posteriors into an explore/exploit policy with no peeking problem, and empirical Bayes shrinkage is the production workhorse: estimate the prior from the population, shrink every small-sample estimate toward it, win uniformly.

Tricky interview questions — chapter 14
Q1. Prior Beta(2,2), observe 7 heads in 10 flips. Posterior, posterior mean, and the pseudo-count story?
Posterior = Beta(2+7, 2+3) = Beta(9,5); mean 9/14 ≈ 0.643 — between the prior mean (0.5) and MLE (0.7). The prior acts as 4 phantom flips (2 heads, 2 tails); with 1000 real flips those 4 phantoms are noise — MAP→MLE as data grows.
Q2. Credible vs confidence interval — one sentence each, and when do they numerically agree?
Credible: "given the data and prior, θ lies here with 95% probability" (probability on θ). Confidence: "this procedure traps θ 95% of the time" (probability on the procedure). With flat priors and large n they coincide numerically — which is why the misreading of CIs usually goes unpunished, and why interviewers probe whether you know the difference anyway.
Q3. Why does Bayesian A/B testing have "no peeking problem," and what's the honest caveat?
The posterior is a valid summary of evidence at ANY moment — P(B>A | data) doesn't depend on your stopping intentions, so monitoring continuously is legitimate. Caveat: if you stop on a decision rule, the frequentist operating characteristics (how often you ship nulls) still depend on that rule and the prior's honesty — Bayes doesn't make bad priors or biased data safe, it just changes which guarantees you carry.
Q4. L2 regularization = which prior? L1? Derive the L2 case in two lines.
MAP with Gaussian prior: argmax log p(D|w) + log p(w), and log N(0, τ²) = −‖w‖²/(2τ²) + const → minimize loss + λ‖w‖², λ = σ²/τ². L1 = Laplace prior (log-density ∝ −|w|). Regularization IS a prior; the bridge question appears in some form in most ML loops.
Q5. Implement the smoothed-CTR formula and pick α, β for a catalog whose global CTR ≈ 2% with effective strength 100 impressions.
smoothed = (clicks + α)/(impressions + α + β) with α = 2, β = 98 (prior mean α/(α+β) = 2%, prior weight α+β = 100 impressions). An item at 3/50 reads (3+2)/(50+100) = 3.3% rather than 6% raw; one at 300/10,000 barely moves. Strength = how many real impressions it takes to overrule the prior.
Q6. Thompson sampling vs ε-greedy vs UCB in one breath each — and which ships most in industry?
ε-greedy: explore a flat ε of the time — simple, wastes exploration on known-bad arms. UCB: play the arm with the highest optimism bound — deterministic, strong theory, awkward with delayed/batched feedback. Thompson: sample posteriors, play the sampled best — exploration auto-proportioned to uncertainty, batch-friendly, and the one most production systems actually run.
Q7. Your Bayesian readout says P(B>A) = 96%. The frequentist test says p = 0.11. How can both be true, and what do you tell the PM?
Different questions: the posterior integrates over the effect's direction (with the prior doing work); the p-value asks how surprising the data is under exactly-zero effect. Near-threshold evidence + an informative prior can easily produce this split. Tell the PM: "evidence leans B — about 24-to-1 on direction — but the magnitude CI includes ~zero; ship if the cost of being wrong is low, extend if it's high." Decision framing beats threshold worship in both paradigms.
Q8. When does empirical-Bayes shrinkage actively mislead?
When the population isn't exchangeable with the unit: a genuinely novel item (different category, viral exposure) shrunk toward the catalog mean gets suppressed exactly when its difference matters; adversarial settings (fraudsters look like outliers BECAUSE they are); and when the prior is fit on a biased population (only surviving items). Fixes: shrink within hierarchies (category-level priors), and let strength decay as units accumulate their own data.
15
PART IV · INTERVIEW CRAFT

The puzzle playbook (the classics, with the patterns)

🎯Every probability puzzle is secretly one of seven patterns — name the pattern first, and the answer follows.

Interviewers recycle the same dozen puzzles dressed in different costumes. This chapter strips the costumes off: you will learn the seven patterns that generate almost every classic, together with the rule for recognizing each pattern in the first ten seconds and the 30-second answer you deliver before diving deeper. Two or three fully worked classics per pattern anchor the rules in memory.

Why patterns beat puzzles

Memorising 30 individual puzzles is fragile — a slight rewordering breaks recall. Internalising 7 patterns is robust: a new costume on a familiar skeleton is immediately recognisable. The workflow is always the same:

  1. Classify — which pattern does this smell like?
  2. State the 30-second answer — headline number or formula, out loud, immediately.
  3. Show the reasoning — derive it cleanly; the interviewer is watching your process.
  4. Sanity-check — plug in extreme values; does the answer make sense at the limits?
Pattern 1 — Symmetry: "all positions are equally likely"

When the problem has a uniform exchangeability structure — every seat, card, or person is interchangeable — you can skip counting and appeal directly to symmetry. The probability that some specific element lands in a particular slot equals 1/(number of slots).

📐 Pattern 1 — the rule

Trigger: "N people/cards/items are arranged randomly — what's the probability that [one specific item] ends up in [a specific position or relationship]?"

  1. Argue symmetry: all arrangements are equally likely, so any particular item occupies any particular slot with probability 1/N.
  2. If the event is a relationship (e.g., "A is before B"), reduce: exactly half of all arrangements have A before B, so P = 1/2.
  3. State the answer immediately; then justify with a two-line counting argument if pushed.

Never: enumerate arrangements for large N. Symmetry is exact and instantaneous.

Classic 1A — The airplane-seat puzzle (100th passenger)

30-second answer: The last passenger sits in their own seat with probability 1/2.

Setup. 100 passengers, 100 seats. Passenger 1 picks a random seat (theirs or someone else's). Each subsequent passenger sits in their assigned seat if it is free, otherwise picks randomly among the remaining seats. What is the probability passenger 100 ends up in seat 100?

Deeper reasoning. The key insight: the process ends the first time either seat 1 or seat 100 is chosen at random. By symmetry, these two outcomes are equally likely whenever a random choice occurs, regardless of how many other passengers have sat down. Therefore P(passenger 100 gets seat 100) = 1/2.

Formal proof by induction (give if pushed). Let $p_n$ = probability the last passenger gets their seat with $n$ passengers. For $n=2$: passenger 1 picks randomly, P = 1/2. For $n>2$: passenger 1 picks seat 1 (prob 1/n, last gets their seat), seat n (prob 1/n, last does NOT), or seat $k$ for $2 \le k \le n-1$ (prob $(n-2)/n$, reduces to subproblem of size $n-k+1$, which by induction also gives 1/2). So $p_n = 1/n + 0 + \tfrac{n-2}{n}\cdot\tfrac{1}{2} = \tfrac{1}{2}$ for all $n \ge 2$.

$$p_n = \frac{1}{2} \quad \text{for all } n \ge 2$$
$p_n$ = probability the $n$-th passenger sits in seat $n$; result is always one-half, independent of $n$.

Sanity check. $n=2$: passenger 1 picks randomly, so P = 1/2. ✓

Classic 1B — Card above the first ace

30-second answer: The expected number of cards you flip before seeing the first ace is 48/5 = 9.6.

Setup. A standard 52-card deck is shuffled uniformly at random. You flip cards one by one. How many non-ace cards appear before the first ace?

Symmetry argument. Think of the 52 cards as 4 aces and 48 non-aces. Consider the 5 "gaps" formed by the 4 aces in a random arrangement: the gap before the first ace, the gaps between consecutive aces, and the gap after the last ace. By symmetry, each of the 48 non-aces is equally likely to fall in any of the 5 gaps. So the expected number before the first ace is 48/5.

Alternative — indicator method. For each non-ace card $i$, let $I_i = 1$ if card $i$ appears before all 4 aces. Among card $i$ and the 4 aces (5 cards total), by symmetry $P(I_i=1)=1/5$. By linearity:

$$E[\text{non-aces before first ace}] = \sum_{i=1}^{48} \frac{1}{5} = \frac{48}{5} = 9.6$$
Each of 48 non-ace cards has probability 1/5 of being the "earliest" among itself and the 4 aces; linearity sums these indicators.
Pattern 2 — Condition on the first step

When the process is sequential and random, condition on what happens at time 1 and write a recursion. This turns a hard global question into a tractable local one. Most "how many flips/steps until…" questions live here.

📐 Pattern 2 — the rule

Trigger: "Expected number of flips/rolls/steps until [pattern/event]" — especially when the target is a sequence like HH or HTH.

  1. Define states that capture all the memory you need (which prefix of the target pattern you have seen so far).
  2. Write E[steps from state s] = 1 + (weighted average of next states).
  3. Solve the system of equations — usually 2–4 equations.
  4. State the final number and explain the key asymmetry (overlap vs no overlap).

Never: try to enumerate all possible flip sequences. The recursion is the way.

Classic 2A — Expected flips to see HH vs HT

30-second answer: HH takes 6 flips; HT takes 4 flips. They differ because HH has an "overlap" that resets progress.

Why they differ — the overlap insight. After seeing H, a T in the HT case completes the target. A T in the HH case doesn't waste the H — wait, it does: you need HH so a T resets to the start. But a second H completes. For HH: after seeing H and then H, you're done; but after seeing H and then T, you're fully back to start (no overlap). For HT: after seeing H and then H, you're still in state "last flip was H" — the new H is potentially the start of HT. So HT has an overlap that HH lacks, making HT easier to achieve.

HT — 3-state solution. States: $S_0$ = start, $S_H$ = last flip was H, $S_{HT}$ = done.

  • From $S_0$: flip H (prob 1/2) → $S_H$; flip T (prob 1/2) → $S_0$. So $E_0 = 1 + \tfrac{1}{2}E_H + \tfrac{1}{2}E_0$.
  • From $S_H$: flip T (prob 1/2) → done; flip H (prob 1/2) → $S_H$. So $E_H = 1 + \tfrac{1}{2}\cdot 0 + \tfrac{1}{2}E_H$.

Solving: $E_H = 2$; $E_0 = 1 + \tfrac{1}{2}\cdot 2 + \tfrac{1}{2}E_0 \Rightarrow E_0 = 4$.

HH — 3-state solution. States: $S_0$ = start, $S_H$ = last flip was H, $S_{HH}$ = done.

  • From $S_0$: flip H → $S_H$; flip T → $S_0$. So $E_0 = 1 + \tfrac{1}{2}E_H + \tfrac{1}{2}E_0$.
  • From $S_H$: flip H (prob 1/2) → done; flip T (prob 1/2) → back to $S_0$ (no overlap!). So $E_H = 1 + \tfrac{1}{2}\cdot 0 + \tfrac{1}{2}E_0$.

Solving: $E_0 = 1 + \tfrac{1}{2}E_H + \tfrac{1}{2}E_0$; $E_H = 1 + \tfrac{1}{2}E_0$. Substituting: $E_H = 1 + \tfrac{1}{2}(1 + \tfrac{1}{2}E_H + \tfrac{1}{2}E_0) = \ldots$ → $E_0 = 6$.

$$E[\text{flips to HT}] = 4, \quad E[\text{flips to HH}] = 6$$
HT: expected 4 flips because a T resets only to "last was H" not to start; HH: expected 6 flips because T after H resets to start with no overlap.
Classic 2B — Gambler's ruin (lite)

30-second answer: Starting with $k$ dollars, playing against a casino with $N$ dollars total (fair coin), probability of ruin is $(N-k)/N$ and expected duration is $k(N-k)$.

Setup. A gambler starts with $\$k$. At each step: win $\$1$ (prob $p$) or lose $\$1$ (prob $q=1-p$). Game ends when fortune hits 0 (ruin) or $N$ (win). What is the probability of ruin?

Condition on first step. Let $r_k = P(\text{ruin} | \text{start at } k)$. Then $r_k = p \cdot r_{k+1} + q \cdot r_{k-1}$ with $r_0=1$, $r_N=0$. This is a second-order linear recurrence.

Fair game ($p = q = 1/2$). The recurrence $r_k = \tfrac{1}{2}r_{k+1} + \tfrac{1}{2}r_{k-1}$ has solutions $r_k = A + Bk$. Boundary conditions: $r_0=1 \Rightarrow A=1$; $r_N=0 \Rightarrow B = -1/N$. So:

$$r_k = 1 - \frac{k}{N} = \frac{N-k}{N}$$
$r_k$ = probability of ruin starting at $k$; $N$ = total chips in game; fair game, ruin probability is linear in starting wealth.

Intuition. In a fair game, the gambler's fortune is a martingale — its expected value never changes. So $E[\text{final fortune}] = k$. With probability $r_k$ the final fortune is 0 and with probability $1-r_k$ it is $N$: so $k = 0 \cdot r_k + N(1-r_k) \Rightarrow r_k = (N-k)/N$. The martingale optional stopping theorem makes this rigorous.

Pattern 3 — Indicators + linearity of expectation

Linearity of expectation holds even under dependence. Any "expected count" question can be written as a sum of indicator random variables, each with a simple probability. The indicators may be wildly dependent — it does not matter.

📐 Pattern 3 — the rule

Trigger: "Expected number of [events/matches/collisions/runs]" where counting directly seems hard.

  1. Define $I_j = \mathbf{1}[\text{event } j \text{ occurs}]$ for each possible event.
  2. Compute $P(I_j = 1)$ — usually a simple symmetry or independence argument.
  3. Apply $E[\sum_j I_j] = \sum_j P(I_j=1)$ — no dependence check needed.
  4. Give the final sum; note that you leveraged linearity, which works even under dependence.

Never: try to compute the full joint distribution of all indicators. You don't need it.

Classic 3A — Hat-check problem (expected fixed points)

30-second answer: The expected number of people who get their own hat back is 1, for any $n \ge 1$.

Setup. $n$ people check their hats; a careless attendant returns them in a uniformly random order. What is the expected number of people who receive their own hat?

Indicator argument. Let $I_i = 1$ if person $i$ gets their own hat. Then $P(I_i = 1) = 1/n$ by symmetry (person $i$'s hat is equally likely to be in any of the $n$ positions). By linearity:

$$E\!\left[\sum_{i=1}^n I_i\right] = \sum_{i=1}^n \frac{1}{n} = 1$$
Sum of $n$ indicators each with probability $1/n$; the $n$'s cancel, leaving exactly 1 regardless of $n$.

Bonus — the derangement connection. The probability that nobody gets their hat ($P(\text{derangement})$) converges to $1/e \approx 0.368$ as $n \to \infty$. The expected count is 1 even though most of the time the actual count is 0 or 1 — a striking illustration of how expectation and typical value can diverge.

Classic 3B — Expected number of runs

30-second answer: In a sequence of $n$ fair coin flips, the expected number of "runs" (maximal blocks of identical values) is $\frac{n+1}{2}$.

Setup. A run is a maximal consecutive block of the same symbol. HHTTHH has 3 runs. What is the expected number of runs in $n$ i.i.d. fair flips?

Indicator argument. The first flip always starts a run. For each subsequent flip $i = 2, \ldots, n$: $I_i = 1$ if flip $i$ differs from flip $i-1$, i.e., it starts a new run. $P(I_i=1) = 1/2$. So:

$$E[\text{runs}] = 1 + \sum_{i=2}^{n} \frac{1}{2} = 1 + \frac{n-1}{2} = \frac{n+1}{2}$$
First flip always starts 1 run; each subsequent flip independently starts a new run with probability 1/2.

Sanity check. $n=1$: 1 run. $(1+1)/2 = 1$. ✓ $n=2$: either HH, TT (1 run, prob 1/2) or HT, TH (2 runs, prob 1/2), so E = 3/2 = (2+1)/2. ✓

Classic 3C — Coupon collector

30-second answer: To collect all $n$ distinct coupons (one per box, uniform), expected boxes needed is $n H_n = n(1 + 1/2 + 1/3 + \cdots + 1/n) \approx n \ln n$.

Indicator argument (waiting times). Let $T_k$ = additional boxes needed to go from $k-1$ distinct coupons to $k$ distinct. When you have $k-1$ coupons, probability the next box is new is $(n-k+1)/n$. So $T_k \sim \text{Geometric}((n-k+1)/n)$ and $E[T_k] = n/(n-k+1)$.

$$E[T] = \sum_{k=1}^{n} \frac{n}{n-k+1} = n \sum_{j=1}^{n} \frac{1}{j} = n H_n$$
$H_n = 1 + 1/2 + \cdots + 1/n$ is the $n$-th harmonic number; for large $n$, $H_n \approx \ln n + 0.577$.

ML cameo. Coupon collector gives the expected number of samples before you see every class at least once — relevant for class-balanced sampling and estimating coverage in large categorical spaces.

Pattern 4 — Complement and "at-least-one"

$P(\text{at least one}) = 1 - P(\text{none})$. The complement is almost always easier to compute when events are independent (none = product of individual probabilities). The birthday problem is the canonical example.

📐 Pattern 4 — the rule

Trigger: "What's the probability that at least one [collision/match/event] occurs?" — especially with independent trials.

  1. Compute P(no event at all) — usually a clean product over independent trials.
  2. Subtract from 1.
  3. If events are not independent, fall back to inclusion-exclusion (but complement of intersection works for many cases).

Never: enumerate all the ways at least one event can happen. The complement is the direct route.

Classic 4A — Birthday problem

30-second answer: With 23 people, P(shared birthday) exceeds 50%. With 57 people it exceeds 99%.

Why the number surprises. It's not about finding someone who shares YOUR birthday (that needs ~183 people). It's about any pair, and pairs grow as $\binom{n}{2}$.

Exact calculation. Assume 365 equally likely birthdays. P(no shared birthday among $n$ people):

$$P(\text{all distinct}) = \frac{365}{365} \cdot \frac{364}{365} \cdot \frac{363}{365} \cdots \frac{365-n+1}{365} = \prod_{k=0}^{n-1}\frac{365-k}{365}$$
Each person's birthday is chosen not to collide with any previous person's; probability decreases multiplicatively.

At $n=23$: this product $\approx 0.493$, so P(collision) $\approx 0.507 > 50\%$.

Quick approximation. Using $\ln(1-x) \approx -x$ for small $x$:

$$\ln P(\text{all distinct}) \approx -\frac{1}{365}\sum_{k=1}^{n-1}k = -\frac{n(n-1)}{2 \cdot 365}$$
Set this equal to $\ln(1/2) \approx -0.693$ to find $n \approx \sqrt{2 \cdot 365 \cdot 0.693} \approx 23$.

Interview follow-up. "How many people until P > 99%?" Same formula: $n \approx \sqrt{2 \cdot 365 \cdot \ln 100} \approx 57$.

Pattern 5 — Conditioning paradoxes

Some puzzles are designed to trick your intuition by hiding information. The cure is mechanical: write out the sample space explicitly, apply Bayes' rule, and be precise about exactly what information was revealed and how it was revealed. Intuition is optional; the calculation is not.

📐 Pattern 5 — the rule

Trigger: "Given that [surprising information], what's the probability of [outcome]?" — Monty Hall, boy-or-girl, envelope variants.

  1. Write the full prior sample space (list all scenarios).
  2. Apply the observation as a filter: cross out scenarios inconsistent with what was revealed.
  3. Re-normalise over the remaining scenarios.
  4. Ask: was the information selection-biased? Monty always opens a goat door — that selection changes the posterior. A random stranger who happens to have a son — different from "I have at least one son."

Never: reason from pure intuition on paradox problems. Always enumerate and apply Bayes.

Classic 5A — Monty Hall

30-second answer: Always switch. Switching wins with probability 2/3; staying wins with probability 1/3.

Setup. Three doors: one car, two goats. You pick door 1. Monty (who knows) opens a goat door (say door 3). Should you switch to door 2?

Posterior calculation.

Car behind door 1 (prob 1/3)
Monty opens door 2 or 3 at random. You stay → win. You switch → lose.
Car behind door 2 (prob 1/3)
Monty must open door 3. You stay → lose. You switch → win.
Car behind door 3 (prob 1/3)
Monty must open door 2. You stay → lose. You switch → win.

Monty opened door 3 in two of the three scenarios (with equal probability conditional on each). Given Monty opened door 3: P(car at door 2 | door 3 opened) = (1/3)/(1/2) = 2/3.

The 1000-door amplifier. Still unconvinced? Imagine 1000 doors: you pick door 1 (1/1000 chance of winning). Monty opens 998 goat doors, leaving door 473. Now switch? Of course — door 473 has prob 999/1000. Monty's knowledge concentrates all the "non-1" probability into the one remaining door.

Why intuition fails. The key is that Monty's choice is not random — he has information and acts on it. This selection-bias is exactly what Bayes' rule captures.

Classic 5B — Boy-or-girl (and why wording matters)

30-second answer: The answer depends critically on how you learned the information. Two phrasings, two answers.

Version 1: "I have two children. One is a boy. What's P(both boys)?"
Sample space of two children (equally likely): BB, BG, GB, GG. Ruling out GG: {BB, BG, GB}. P(BB | at least one boy) = 1/3.

Version 2: "I have two children. The older is a boy. What's P(both boys)?"
Sample space conditioning on the older child: (older=B, younger=B) or (older=B, younger=G). P(BB | older is boy) = 1/2.

The lesson. "At least one boy" is weaker information than "the older is a boy." The first reveals a property of the pair; the second reveals a property of a specific child. Specify exactly what is revealed and how — then calculate mechanically.

⚠ Clears up

Common error: treating "I have at least one boy" and "I randomly selected one child and they're a boy" as the same. They're not. If you randomly sample one child and reveal they are a boy, the posterior is 1/2 (same as Version 2). The selection mechanism changes the posterior.

Classic 5C — Two envelopes (one-liner)

30-second answer: There is no switching strategy that improves your expected payoff — the apparent "gain by switching" argument contains a hidden error.

The puzzle. Two envelopes: one has $\$X$, the other $\$2X$. You open one and see amount $m$. The other has either $2m$ or $m/2$. Seemingly: E[switch] = $\tfrac{1}{2}(2m) + \tfrac{1}{2}(m/2) = 1.25m > m$. Switch!

The error. The two cases cannot both be equally likely for every value of $m$. If $X$ has a prior distribution, P(other = 2m | saw m) ≠ 1/2 in general — it depends on the prior over $X$. For a proper (integrable) prior on $X$, the gain from switching integrates to zero. The paradox exists because you're applying an improper uniform prior over all positive reals, which isn't a valid probability distribution.

Pattern 6 — Geometric and continuous probability

When the sample space is a geometric region (line segment, square, circle), probability = (favorable area)/(total area). Draw the region, shade the favorable set, compute the ratio.

📐 Pattern 6 — the rule

Trigger: "Two points/times chosen uniformly on [0,1] (or [0,L])…" — meeting problems, stick-breaking, triangle inequalities.

  1. Draw the unit square (or appropriate region) representing all $(x, y)$ pairs.
  2. Shade the favorable region: express the constraint as an inequality in $x$ and $y$.
  3. Compute shaded area / total area.

Never: try to compute this with discrete approximations or integrals if geometry gives the answer directly.

Classic 6A — Stick-breaking triangle

30-second answer: Break a stick at two random points; probability the three pieces form a triangle is 1/4.

Setup. Choose two break points $X, Y$ uniformly on $[0,1]$. WLOG assume $X < Y$; the three pieces have lengths $X$, $Y-X$, $1-Y$.

Triangle inequality conditions. Three lengths form a triangle iff each is less than 1/2 (since the longest side must be less than the sum of the other two, which equals 1 minus the longest). So all three conditions reduce to: $X < 1/2$, $Y - X < 1/2$, $1 - Y < 1/2$.

Unit square with $(X, Y)$ pairs: the triangle-forming region is a small triangle of area 1/4 in the center.

Area calculation. In the full unit square (not just $X < Y$), by symmetry: P(triangle) = 1/4. The favorable region is a triangle with vertices at $(0, 1/2)$, $(1/2, 1/2)$, $(1/2, 1)$, which has area $= \tfrac{1}{2} \cdot \tfrac{1}{2} \cdot \tfrac{1}{2} = 1/8$ of the upper triangle (where $Y > X$), doubling for the full square gives 1/4.

Classic 6B — The meeting problem

30-second answer: Two people each arrive uniformly in $[0, 60]$ minutes and wait 15 minutes. P(they meet) = 7/16.

Setup. Let $X$ and $Y$ be arrival times, uniform on $[0, 60]$. They meet if $|X - Y| \le 15$.

Geometric solution. Total area = $60^2 = 3600$. The non-meeting region consists of two triangles where $X - Y > 15$ or $Y - X > 15$, each with legs $60 - 15 = 45$. Non-meeting area = $2 \cdot \tfrac{1}{2} \cdot 45^2 = 45^2 = 2025$. P(meet) = $(3600 - 2025)/3600 = 1575/3600 = 7/16$.

$$P(|X-Y| \le t) = 1 - \left(\frac{T-t}{T}\right)^2 \quad \text{for uniform}[0,T]$$
$T$ = total window; $t$ = wait time; the formula gives the probability of meeting; the non-meeting region is two right triangles.
Pattern 7 — Random walks and martingale flavor

A martingale is a process whose expected future value, given the present, equals the present value. This is an enormously powerful tool: if you can identify a martingale, optional stopping gives you the expected stopping time or probability for free — without solving a recursion explicitly.

The drunkard's walk. A drunkard starts at position $k$ on a number line, steps right (+1) or left (−1) with equal probability. Absorbing barriers at 0 and $N$. The position itself is a martingale.

📐 Pattern 7 — the rule

Trigger: "Expected steps/time until absorption" or "probability of reaching A before B" in a symmetric random walk.

  1. Identify the martingale: for a fair walk, $X_t$ (position) is a martingale. For "expected steps," $X_t^2 - t$ is also a martingale.
  2. Apply optional stopping: $E[X_\tau] = E[X_0]$ where $\tau$ is the stopping time.
  3. Express $E[X_\tau]$ in terms of boundary probabilities (P(hit $N$) vs P(hit $0$)) and solve.
  4. For expected steps, use the second martingale $X_t^2 - t$.

Never: try to compute the distribution of $\tau$ directly for the expected value — the martingale trick bypasses it entirely.

Classic 7A — 1D drunkard: probability and expected steps

30-second answer: Starting at $k$, P(reach $N$ before 0) = $k/N$. Expected steps to absorption = $k(N-k)$.

Probability via martingale. $X_t$ (position) is a martingale (fair walk). By optional stopping: $E[X_\tau] = X_0 = k$. At $\tau$, $X_\tau = N$ (prob $p$) or $X_\tau = 0$ (prob $1-p$). So $Np + 0 = k \Rightarrow p = k/N$.

Expected steps via the second martingale. $M_t = X_t^2 - t$ is a martingale (each step: $E[X_{t+1}^2] = \tfrac{1}{2}(X_t+1)^2 + \tfrac{1}{2}(X_t-1)^2 = X_t^2 + 1$, so $E[M_{t+1}] = M_t$). At stopping time $\tau$: $E[X_\tau^2 - \tau] = X_0^2 - 0 = k^2$. But $E[X_\tau^2] = N^2 \cdot (k/N) + 0 \cdot (1-k/N) = kN$. Therefore $E[\tau] = kN - k^2 = k(N-k)$.

$$E[\tau] = k(N-k), \quad P(\text{reach } N) = \frac{k}{N}$$
Starting at $k$ with absorbers at $0$ and $N$ in a fair walk; $\tau$ = absorption time; both results follow from two martingales.

Sanity check. $k = N/2$ (starting in the middle): P(win) = 1/2 and E[$\tau$] = $(N/2)^2 = N^2/4$ — maximum expected duration at the symmetric starting point. ✓

Classic 7B — Asymmetric walk and exponential martingale

30-second answer: Biased walk with $p \ne 1/2$; P(ruin from $k$) = $[(q/p)^k - (q/p)^N]/[1 - (q/p)^N]$ where $q = 1-p$.

Martingale for biased walk. For $p \ne 1/2$, position $X_t$ is no longer a martingale (it has drift). Instead, $M_t = (q/p)^{X_t}$ is a martingale (verify: $E[M_{t+1}|X_t] = p(q/p)^{X_t+1} + q(q/p)^{X_t-1} = (q/p)^{X_t}[p(q/p) + q(p/q)] = (q/p)^{X_t}$). By optional stopping:

$$(q/p)^k = P(\text{reach }N) \cdot (q/p)^N + P(\text{ruin}) \cdot (q/p)^0$$
$p$ = probability of stepping right; $q = 1-p$; $(q/p)^{X_t}$ is the exponential martingale for biased walks.

Solving with $P(\text{reach }N) + P(\text{ruin}) = 1$ gives P(ruin from $k$) = $\frac{(q/p)^k - (q/p)^N}{1 - (q/p)^N}$. When $p < 1/2$ and $N \to \infty$: P(ruin) → 1 (house always wins).

The 7-pattern cheat sheet
PatternSignal phraseToolKey insight
1 · Symmetry"randomly arranged," "uniform"P = 1/N or 1/2All positions interchangeable
2 · First step"expected flips until," "first time"State machine + recurrenceOverlap = longer wait
3 · Indicators"expected number of"$\sum P(I_j=1)$Works under dependence
4 · Complement"at least one," "any pair"1 – P(none)P(none) = clean product
5 · Conditioning paradox"given that," "Monty," "boy-girl"Enumerate + BayesAsk: how was info selected?
6 · Geometric"uniform on [0,1]," "arrive randomly"Favorable area / total areaDraw the square, shade, compute
7 · Martingale"random walk," "absorption," "ruin"Optional stopping theoremFind the right martingale
✓ Remember
  • Name the pattern in the first 10 seconds; the answer structure follows from the pattern.
  • Linearity of expectation holds even under dependence — the indicator method always works for "expected count" questions.
  • HH takes 6 flips; HT takes 4 flips — the overlap asymmetry is the favourite follow-up question.
  • Monty Hall: the host's knowledge and non-random selection is the entire mechanism — model it with Bayes.
  • Geometric probability: draw the square, shade the favorable region, divide areas.
TL;DR

Seven patterns cover essentially all classic probability puzzles. Symmetry and indicators are the fastest weapons; first-step recursion handles sequential processes; complement sidesteps "at-least-one" counting; Bayes is the antidote to conditioning paradoxes; area ratios solve continuous uniform problems; martingales give absorption probabilities and expected durations without heavy computation. Classify, state the 30-second answer, then derive cleanly.

Tricky interview questions — chapter 15
Q1. A fair coin is flipped until HH appears. A second coin is flipped until HT appears. Which finishes first on average, and by how much?
HT finishes first: E[HT] = 4 flips vs E[HH] = 6 flips — HT is faster by 2 flips on average. The reason is the overlap structure: after seeing H then T, the process is done (no overlap). After seeing H then T in the HH game, you're fully back to start. In the HT game, after seeing H then H, you're still in state "last was H" — that H counts toward the next attempt. The overlap in HH resets further; the overlap in HT helps.
Q2. You pick one of three doors. Monty opens a door you didn't pick that has a goat. What if Monty doesn't know where the car is and opens a random door — should you still switch?
If Monty opens randomly and happens to reveal a goat, the posterior probability is 1/2 for each remaining door — switching doesn't help. The key difference: if Monty knows, his action is selection-biased and concentrates probability on the other door (2/3). If Monty picks randomly, his revealing a goat is purely informational and symmetric — it updates both remaining doors equally. Always ask: was the information generated by a process that depends on the truth?
Q3. 100 people in a circle; each person simultaneously picks a random neighbor. Expected number of pairs that pick each other?
Use indicators. For each adjacent pair $(i, j)$: let $I_{ij} = 1$ if they pick each other. P($i$ picks $j$ AND $j$ picks $i$) = $(1/2)(1/2) = 1/4$ (each independently picks from 2 neighbors). There are 100 adjacent pairs. E[mutual picks] = $100 \times 1/4 = 25$. Linearity of expectation under dependence — adjacent pairs are not independent, but we don't need them to be.
Q4. A deck of 52 cards is shuffled. What is the expected number of cards you flip before seeing the first spade?
By symmetry, consider the 13 spades and 39 non-spades. There are 14 "gaps" formed by 13 spades (before first, between each pair, after last). By symmetry each non-spade is equally likely to fall in any of the 14 gaps, so expected non-spades before the first spade = 39/14 = 2.786. Alternatively: among any non-spade and the 13 spades (14 cards total), P(non-spade comes first among them all) = 1/14, so sum over 39 non-spades = 39/14.
Q5. Two points are chosen uniformly on a circle of circumference 1. What is the expected arc length of the shorter arc?
Let the arc length between the two points (going one way) be $U$, uniform on $[0,1]$. The shorter arc has length $\min(U, 1-U)$. By symmetry, E[min(U, 1-U)] = E[U | U < 1/2] × P(U < 1/2) + E[1-U | U > 1/2] × P(U > 1/2). Both terms equal $\int_0^{1/2} u \, du = 1/8$, so E[shorter arc] = 1/4. The expected shorter arc is always 1/4 of the total circumference.
Q6. You roll a fair die repeatedly. What is the expected number of rolls to see all 6 faces?
Coupon collector with n=6. E[T] = 6(1 + 1/2 + 1/3 + 1/4 + 1/5 + 1/6) = 6 × 49/20 = 6 × 2.45 = 14.7 rolls. Each phase: you have k distinct faces; expected rolls to get a new one = 6/(6-k). Sum from k=0 to k=5: 6/6 + 6/5 + 6/4 + 6/3 + 6/2 + 6/1 = 1 + 1.2 + 1.5 + 2 + 3 + 6 = 14.7.
Q7. A drunkard starts at position 3 on a line with absorbers at 0 and 10. What is the expected number of steps until absorption?
For fair random walk with absorbers at 0 and N, starting at k: E[steps] = k(N-k). Here k=3, N=10: E[steps] = 3 × 7 = 21. This follows from the martingale argument: $X_t^2 - t$ is a martingale, and optional stopping gives E[T] = E[X_T²] - X_0² = kN - k² = k(N-k).
Q8. From a shuffled deck, you draw cards one at a time. What is the probability that the 4th card drawn is the first ace?
P(first ace is at position 4) = P(first 3 are non-aces) × P(4th is ace given first 3 were non-aces) = (48/52)(47/51)(46/50)(4/49). Compute: 48×47×46×4 / (52×51×50×49) = (48×47×46×4)/(52×51×50×49). Numerically: 48×47 = 2256, ×46 = 103776, ×4 = 415104; denominator: 52×51=2652, ×50=132600, ×49=6497400. Result ≈ 0.0639 or about 6.4%.
Q9. n people sit in a row at random. Two specific people, Alice and Bob, are in the group. What is the probability that they sit adjacent?
Think of Alice and Bob as a unit: there are (n-1) ways to place the "AB block" and 2 ways to arrange them within the block. Total arrangements = n! . Favorable = 2(n-1)!. Probability = 2(n-1)!/n! = 2/n. Equivalently: from Alice's perspective, Bob is equally likely to be in any of the n-1 other seats; exactly 2 of them (immediately left or right) are adjacent. P = 2/(n-1). Wait — careful: 2/(n-1) since there are n-1 other people. For n=2: P=1, which checks out. The answer is 2/(n-1).
Q10. You repeatedly flip a fair coin. What is the probability you see HHH before you see THH?
This is a pattern-vs-pattern race. Use the Conway leading number trick or a Markov chain. The states are prefixes of both targets. By careful analysis (or lookup): P(HHH before THH) = 1/8. Key insight: THH is "easier" to reach from many states. In particular, once you see T, HHH becomes impossible until the next fresh start, but THH becomes reachable in 2 steps. The martingale/Markov analysis gives P(HHH first) = 1/8 (i.e., HHH loses 7:1 against THH).
Q11. The two-child problem: I have two children and at least one is a boy born on Tuesday. What is P(both boys)?
This is a famous extension. Enumerate (child 1 sex, child 1 day-of-week, child 2 sex, child 2 day-of-week). Equally likely outcomes: 4 × 7 × 4 × 7 = 196 excluding the ordering. P(at least one boy-Tuesday) sample space: 13 + 13 - 1 = 27 outcomes (boy-Tuesday paired with any (sex, day), plus any (sex,day) paired with boy-Tuesday, minus double-counted boy-Tuesday/boy-Tuesday). Of those 27, boy-boy outcomes: 13 pairs where at least one is boy-Tuesday; non-boy-boy outcomes: 14. So P(both boys | at least one boy-Tuesday) = 13/27 ≈ 0.481, which is between 1/3 (at-least-one-boy) and 1/2 (specific child is boy). The day-of-week information shifts the answer!
Q12. You and a friend each choose a random integer from 1 to N. What is the expected number of rounds until you choose the same number?
Each round, P(match) = 1/N (you need your friend's number, one of N choices, to equal yours). Rounds until match ~ Geometric(1/N). Expected rounds = N. For N=6 (both rolling a die): expected 6 rounds to match. This is exactly the coupon-collector phase of length 1: each round is an independent trial with success probability 1/N.
16
PART IV · INTERVIEW CRAFT

Statistics for ML + Rapid Fire

🎯Every loss function is a disguised likelihood; every regularizer is a disguised prior — once you see the map, ML and statistics are one subject.

This capstone chapter connects every statistical concept from the page to real ML practice, then stress-tests your recall with 25 rapid-fire Q→A pairs. We trace the loss ↔ likelihood correspondence, explain regularization as Bayesian priors, tie bias-variance to estimator theory, and close with the sampling pathologies that silently corrupt models. Finish here and you can speak fluent statistics in any ML design interview.

§1 · The loss ↔ likelihood map

Every supervised loss function is the negative log-likelihood of some noise model. Minimising loss = maximising likelihood. The table below is the Rosetta Stone; memorise it.

$$\mathcal{L}(\theta) = -\log p(\mathbf{y}\mid\mathbf{x},\theta)$$
Minimising loss $\mathcal{L}$ over parameters $\theta$ is equivalent to MLE under the noise distribution $p$.
Loss functionNoise / likelihood modelCanonical ML taskKey property
Squared error $\tfrac{1}{2}(y-\hat y)^2$ Gaussian $y\mid x \sim \mathcal{N}(\hat y,\sigma^2)$ Linear/neural regression Penalises large errors quadratically; non-robust to outliers
Binary cross-entropy $-[y\log\hat p+(1-y)\log(1-\hat p)]$ Bernoulli $y\mid x \sim \text{Bern}(\hat p)$ Logistic regression, binary classif. Log-scale gradient; never saturates on wrong side with sigmoid
Categorical cross-entropy $-\sum_k y_k \log\hat p_k$ Categorical $\mathbf{y}\mid x \sim \text{Cat}(\hat{\mathbf{p}})$ Softmax classifiers, language models Equivalent to KL divergence from one-hot to predicted distribution
MAE / L1 loss $|y-\hat y|$ Laplace $y\mid x \sim \text{Laplace}(\hat y,b)$ Robust regression Heavier tails → robust to outliers; non-smooth at 0 (subgradients needed)
Huber loss (quadratic near 0, linear far) Huber / pseudo-Huber distribution Object detection, robust regression Best of both: smooth + outlier-resistant; threshold $\delta$ is a hyperparameter
KL divergence $D_{\text{KL}}(p\|q)$ Exponential family likelihood VAE ELBO, knowledge distillation Asymmetric; $D_{\text{KL}}(p\|q)\neq D_{\text{KL}}(q\|p)$; zero-forcing vs mass-covering
⚠ Clears up — "Why not always use MAE if it's robust?"

MAE is not differentiable at zero, so gradient-based optimisers need subgradients and converge more slowly. More importantly, the Laplace assumption has heavier tails than Gaussian, so outlier gradients are clipped linearly, which is good for robustness but bad for precision on well-behaved data. Choose based on the noise model, not just robustness desire.

§2 · Regularization ↔ priors

Adding a regulariser to the loss is identical to adding a prior and doing MAP estimation instead of MLE. The prior encodes your belief about parameter scale before seeing data.

$$\hat\theta_{\text{MAP}} = \arg\max_\theta \underbrace{\log p(\mathbf{y}\mid\theta)}_{\text{log-likelihood}} + \underbrace{\log p(\theta)}_{\text{log-prior}}$$
MAP adds a log-prior penalty to the log-likelihood. Maximising MAP = minimising (NLL − log prior) = minimising (loss + regulariser).
RegularizerEquivalent prior on $\theta$Effect on weightsSparsity?
L2 / Ridge $\lambda\|\theta\|_2^2$ Gaussian $\theta_j \sim \mathcal{N}(0,\,1/2\lambda)$ Shrinks weights toward 0; all stay non-zero No
L1 / Lasso $\lambda\|\theta\|_1$ Laplace $\theta_j \sim \text{Laplace}(0,\,1/\lambda)$ Shrinks and zeroes out weights Yes — exact zeros
ElasticNet $\lambda_1\|\theta\|_1+\lambda_2\|\theta\|_2^2$ Mixture of Laplace + Gaussian Sparse but handles correlated features better than pure L1 Yes — partial
Dropout (rate $p$) Approximate Bayesian posterior (Gal & Ghahramani 2016) Ensemble-like; acts as adaptive L2 No

Worked intuition. Ridge: the Gaussian prior says "weights near zero are exponentially more likely than large weights." So maximising log-prior = $-\lambda\|\theta\|^2$, which is the same as subtracting L2 penalty from the log-likelihood. As $n\to\infty$, the likelihood dominates and MAP → MLE; as $n\to 0$, MAP collapses to the prior mean (zero).

◆ Interview probe

"You switch from L2 to L1 regularisation and training loss barely changes but the number of non-zero features drops by 80%. What happened and is that good?" — Answer: L1 MAP under a Laplace prior has a non-smooth penalty that corners at zero, creating sparse exact-zero solutions. Whether it's good depends on whether interpretability/inference cost matters; L1 can be unstable with correlated features.

§3 · Bias-variance ↔ estimator theory

The bias-variance tradeoff in ML is exactly the bias-variance decomposition of a statistical estimator. The MSE of any estimator $\hat\theta$ decomposes as:

$$\text{MSE}(\hat\theta) = \underbrace{\text{Bias}(\hat\theta)^2}_{\text{systematic error}} + \underbrace{\text{Var}(\hat\theta)}_{\text{random error}}$$
$\text{Bias}=\mathbb{E}[\hat\theta]-\theta^*$; $\text{Var}=\mathbb{E}[(\hat\theta-\mathbb{E}[\hat\theta])^2]$. You can't reduce both to zero simultaneously by changing model complexity alone.
High bias (underfitting)
Model is too simple to capture true pattern; $\mathbb{E}[\hat\theta]\neq\theta^*$. Example: linear model on quadratic data.
High variance (overfitting)
Model memorises training noise; predictions vary wildly with new data. Example: depth-50 decision tree on 100 samples.
Regularisation effect
Adds bias deliberately to shrink variance; MAP estimators are biased but lower-MSE than MLE at small $n$.
Ensemble effect
Averaging $B$ independent models: bias unchanged, variance divided by $B$. Bagging exploits this.
Boosting effect
Sequentially reduces bias; can increase variance (mitigated by shrinkage/learning rate).
⚠ Clears up — "Does more data always help?"

More data reduces variance (estimators concentrate), but does not reduce bias. If your model is misspecified (wrong functional form), no amount of data fixes systematic error. "More data cures variance; better model cures bias."