{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Mini-project: calibration lab + A/B testing simulator\n",
    "**Goal:** Part A \u2014 build reliability diagrams and ECE from scratch, then FIX a miscalibrated\n",
    "classifier with Platt scaling. Part B \u2014 simulate A/B tests and SEE peeking inflate false\n",
    "positives, then watch CUPED shrink confidence intervals.\n",
    "**Concepts:** calibration, ECE, Platt, sequential-testing sins, variance reduction. **Time:** ~3h."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "rng = np.random.default_rng(7)\n",
    "np.seterr(all=\"ignore\")\n",
    "\n",
    "# Miscalibrated-by-construction classifier: true P(y=1|x) = sigmoid(z),\n",
    "# but the model reports sigmoid(2.2 * z) \u2014 systematically overconfident.\n",
    "def sigmoid(z):\n",
    "    return 1 / (1 + np.exp(-z))\n",
    "\n",
    "N = 20000\n",
    "Z_TRUE = rng.normal(0, 1.2, N)\n",
    "Y = (rng.random(N) < sigmoid(Z_TRUE)).astype(int)\n",
    "SCORES = sigmoid(2.2 * Z_TRUE)            # what the \"model\" outputs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Stage A1 \u2014 reliability diagram + ECE\n",
    "Bin predictions into 10 equal-width bins; per bin compute mean predicted probability and\n",
    "observed positive rate. Perfect calibration: the two match. ECE = sum over bins of\n",
    "(bin_count/N) * |mean_pred - observed_rate|."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "def reliability_table(scores, y, num_bins=10):\n",
    "    \"\"\"Return list of (bin_count, mean_pred, observed_rate) per non-empty bin.\"\"\"\n",
    "    raise NotImplementedError  # TODO(you)\n",
    "\n",
    "def ece(scores, y, num_bins=10):\n",
    "    raise NotImplementedError  # TODO(you)\n",
    "\n",
    "def check_stageA1():\n",
    "    value = ece(SCORES, Y)\n",
    "    print(f\"ECE of the overconfident model: {value:.3f}\")\n",
    "    assert value > 0.05, \"this model is broken by construction; ECE should show it\"\n",
    "# check_stageA1()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Stage A2 \u2014 Platt scaling by gradient descent\n",
    "Fit p_cal = sigmoid(a * logit(score) + b) on a held-out half by minimizing log loss with\n",
    "plain gradient descent (200 steps, lr 0.1 is plenty). Evaluate ECE on the other half:\n",
    "it should collapse. (Isotonic regression is the non-parametric alternative \u2014 stretch goal.)"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "def platt_fit(scores, y, steps=200, lr=0.1):\n",
    "    \"\"\"Return (a, b).\"\"\"\n",
    "    raise NotImplementedError  # TODO(you)\n",
    "\n",
    "def check_stageA2():\n",
    "    half = N // 2\n",
    "    a, b = platt_fit(SCORES[:half], Y[:half])\n",
    "    logit = np.log(SCORES[half:] / (1 - SCORES[half:]))\n",
    "    calibrated = sigmoid(a * logit + b)\n",
    "    before, after = ece(SCORES[half:], Y[half:]), ece(calibrated, Y[half:])\n",
    "    print(f\"ECE before {before:.3f} -> after Platt {after:.3f}\")\n",
    "    assert after < before / 2\n",
    "# check_stageA2()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Stage B1 \u2014 the peeking simulator\n",
    "Two arms with IDENTICAL conversion rate 5%. Simulate 1,000 experiments of 20,000 users/arm.\n",
    "Policy A (honest): one z-test at the end, alpha=.05. Policy B (peeker): test after every\n",
    "1,000 users/arm and STOP at the first p<.05. Measure each policy's false-positive rate.\n",
    "Expected: honest ~5%, peeker ~25-40%. Seeing this once is worth ten lectures."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "def z_test_two_proportions(conv_a, n_a, conv_b, n_b):\n",
    "    \"\"\"Return two-sided p-value.\"\"\"\n",
    "    raise NotImplementedError  # TODO(you)\n",
    "\n",
    "def false_positive_rate(peek_every=None, num_sims=1000, n_per_arm=20000, rate=0.05):\n",
    "    \"\"\"Fraction of null experiments declared 'significant'.\"\"\"\n",
    "    raise NotImplementedError  # TODO(you)\n",
    "\n",
    "def check_stageB1():\n",
    "    honest = false_positive_rate(peek_every=None, num_sims=400)\n",
    "    peeker = false_positive_rate(peek_every=1000, num_sims=400)\n",
    "    print(f\"false-positive rate \u2014 honest: {honest:.3f}, peeker: {peeker:.3f}\")\n",
    "    assert peeker > honest * 2\n",
    "# check_stageB1()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Stage B2 \u2014 CUPED\n",
    "Users have a pre-experiment covariate correlated with their experiment metric (heavy users\n",
    "stay heavy). Simulate y = 0.6 * x_pre + noise (+ tiny true effect in treatment).\n",
    "CUPED: y_adj = y - theta * (x_pre - mean(x_pre)), theta = cov(y, x_pre) / var(x_pre).\n",
    "Compare the CI width of the treatment-effect estimate before/after: variance should drop\n",
    "by roughly corr^2 (~36% here, so CIs ~20% tighter, i.e. ~36% less sample needed)."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "def cuped_adjust(y, x_pre):\n",
    "    raise NotImplementedError  # TODO(you)\n",
    "\n",
    "def check_stageB2():\n",
    "    n = 20000\n",
    "    x_pre = rng.normal(10, 3, 2 * n)\n",
    "    effect = np.concatenate([np.zeros(n), np.full(n, 0.15)])\n",
    "    y = 0.6 * x_pre + effect + rng.normal(0, 2.4, 2 * n)\n",
    "    raw_se = np.sqrt(y[:n].var()/n + y[n:].var()/n)\n",
    "    y_adj = cuped_adjust(y, x_pre)\n",
    "    adj_se = np.sqrt(y_adj[:n].var()/n + y_adj[n:].var()/n)\n",
    "    print(f\"SE raw {raw_se:.4f} -> CUPED {adj_se:.4f} ({100*(1-adj_se/raw_se):.0f}% tighter)\")\n",
    "    assert adj_se < raw_se * 0.92\n",
    "# check_stageB2()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Stretch:** isotonic-lite calibration (pool adjacent violators); a group-sequential\n",
    "boundary (alpha-spending) that makes peeking legal; simulate Simpson's paradox with a\n",
    "mix-shifted segment table."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "# SOLUTIONS \u2014 no peeking (ironic, for this notebook) until your attempt"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "def solution_reliability_table(scores, y, num_bins=10):\n",
    "    rows = []\n",
    "    edges = np.linspace(0, 1, num_bins + 1)\n",
    "    for lo, hi in zip(edges[:-1], edges[1:]):\n",
    "        in_bin = (scores >= lo) & (scores < hi) if hi < 1 else (scores >= lo) & (scores <= hi)\n",
    "        count = int(in_bin.sum())\n",
    "        if count:\n",
    "            rows.append((count, float(scores[in_bin].mean()), float(y[in_bin].mean())))\n",
    "    return rows\n",
    "\n",
    "def solution_ece(scores, y, num_bins=10):\n",
    "    total = len(scores)\n",
    "    return sum(c / total * abs(p - o) for c, p, o in solution_reliability_table(scores, y, num_bins))\n",
    "\n",
    "def solution_platt_fit(scores, y, steps=200, lr=0.1):\n",
    "    logit = np.log(scores / (1 - scores))\n",
    "    a, b = 1.0, 0.0\n",
    "    for _ in range(steps):\n",
    "        p = sigmoid(a * logit + b)\n",
    "        grad_a = ((p - y) * logit).mean()\n",
    "        grad_b = (p - y).mean()\n",
    "        a -= lr * grad_a\n",
    "        b -= lr * grad_b\n",
    "    return a, b\n",
    "\n",
    "def solution_z_test_two_proportions(conv_a, n_a, conv_b, n_b):\n",
    "    from math import erf, sqrt\n",
    "    p_a, p_b = conv_a / n_a, conv_b / n_b\n",
    "    pooled = (conv_a + conv_b) / (n_a + n_b)\n",
    "    se = sqrt(pooled * (1 - pooled) * (1 / n_a + 1 / n_b))\n",
    "    if se == 0:\n",
    "        return 1.0\n",
    "    z = (p_a - p_b) / se\n",
    "    return 2 * (1 - 0.5 * (1 + erf(abs(z) / sqrt(2))))\n",
    "\n",
    "def solution_false_positive_rate(peek_every=None, num_sims=1000, n_per_arm=20000, rate=0.05):\n",
    "    hits = 0\n",
    "    for _ in range(num_sims):\n",
    "        a = rng.random(n_per_arm) < rate\n",
    "        b = rng.random(n_per_arm) < rate\n",
    "        if peek_every is None:\n",
    "            p = solution_z_test_two_proportions(a.sum(), n_per_arm, b.sum(), n_per_arm)\n",
    "            hits += p < 0.05\n",
    "        else:\n",
    "            for n in range(peek_every, n_per_arm + 1, peek_every):\n",
    "                p = solution_z_test_two_proportions(a[:n].sum(), n, b[:n].sum(), n)\n",
    "                if p < 0.05:\n",
    "                    hits += 1\n",
    "                    break\n",
    "    return hits / num_sims\n",
    "\n",
    "def solution_cuped_adjust(y, x_pre):\n",
    "    theta = np.cov(y, x_pre)[0, 1] / x_pre.var()\n",
    "    return y - theta * (x_pre - x_pre.mean())"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "if __name__ == \"__main__\":\n",
    "    print(f\"ECE before: {solution_ece(SCORES, Y):.3f}\")\n",
    "    a, b = solution_platt_fit(SCORES[:N//2], Y[:N//2])\n",
    "    logit = np.log(SCORES[N//2:] / (1 - SCORES[N//2:]))\n",
    "    print(f\"ECE after Platt: {solution_ece(sigmoid(a*logit+b), Y[N//2:]):.3f}\")\n",
    "    honest = solution_false_positive_rate(None, num_sims=300)\n",
    "    peeker = solution_false_positive_rate(1000, num_sims=300)\n",
    "    print(f\"false positives \u2014 honest {honest:.3f} vs peeker {peeker:.3f}\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}