{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Mini-project: production debugging \u2014 you're on call\n",
    "**Goal:** 14 days of service metrics hide TWO planted incidents. Plot, build detectors,\n",
    "localize each incident to its causal metric, write the postmortem. The asserts check\n",
    "whether you found the right day and the right root-cause metric.\n",
    "**Concepts:** monitoring layers, anomaly detection, causal localization. **Time:** ~2h.\n",
    "\n",
    "**Rules:** do NOT read the setup cell's internals until you're done (it contains the answers).\n",
    "Work from the data like you would from dashboards."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "rng = np.random.default_rng(99)\n",
    "np.seterr(all=\"ignore\")\n",
    "\n",
    "HOURS = 14 * 24\n",
    "t = np.arange(HOURS)\n",
    "\n",
    "def _diurnal(base, amplitude):\n",
    "    return base + amplitude * np.sin(2 * np.pi * (t % 24) / 24 - 1.2)\n",
    "\n",
    "def _make_metrics():\n",
    "    \"\"\"SPOILER CELL \u2014 generates the incidents. Skim the OUTPUT, not this code.\"\"\"\n",
    "    qps = _diurnal(1000, 300) + rng.normal(0, 25, HOURS)\n",
    "    cache_hit = np.clip(0.92 + rng.normal(0, 0.004, HOURS), 0, 1)\n",
    "    p50 = _diurnal(42, 4) + rng.normal(0, 1.5, HOURS)\n",
    "    p99 = _diurnal(95, 12) + rng.normal(0, 6, HOURS)\n",
    "    feature_nulls = np.clip(0.002 + rng.normal(0, 0.0005, HOURS), 0, 1)\n",
    "    score_mean = 0.31 + rng.normal(0, 0.004, HOURS)\n",
    "    ctr = 0.051 + rng.normal(0, 0.0012, HOURS)\n",
    "\n",
    "    # Incident 1: deploy at day 4, 14:00 -> cache hit collapses, p99 spikes (p50 mild).\n",
    "    DEPLOY = 4 * 24 + 14\n",
    "    cache_hit[DEPLOY:DEPLOY + 30] -= 0.25 * np.exp(-np.arange(30) / 12)\n",
    "    p99[DEPLOY:DEPLOY + 30] += 140 * np.exp(-np.arange(30) / 12)\n",
    "    p50[DEPLOY:DEPLOY + 30] += 6 * np.exp(-np.arange(30) / 12)\n",
    "\n",
    "    # Incident 2: upstream pipeline breaks day 9, 03:00 -> nulls spike, score drifts, CTR sags (lagged).\n",
    "    PIPE = 9 * 24 + 3\n",
    "    feature_nulls[PIPE:] += 0.04\n",
    "    score_mean[PIPE:] += 0.05\n",
    "    ctr[PIPE + 6:] -= 0.004          # users feel it ~6h later\n",
    "\n",
    "    return {\n",
    "        \"qps\": qps, \"cache_hit_rate\": cache_hit, \"p50_ms\": p50, \"p99_ms\": p99,\n",
    "        \"feature_null_rate\": feature_nulls, \"model_score_mean\": score_mean, \"ctr\": ctr,\n",
    "    }, DEPLOY, PIPE\n",
    "\n",
    "METRICS, _INC1_HOUR, _INC2_HOUR = _make_metrics()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Stage 1 \u2014 look before you compute\n",
    "Print per-day means for each metric (14 rows). Which days look off, and in which metrics?\n",
    "Write your two hypotheses as comments BEFORE moving on \u2014 that's the discipline being trained."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "def daily_means(metrics):\n",
    "    \"\"\"Return dict metric_name -> array of 14 daily means.\"\"\"\n",
    "    raise NotImplementedError  # TODO(you)\n",
    "\n",
    "def check_stage1():\n",
    "    d = daily_means(METRICS)\n",
    "    assert all(len(v) == 14 for v in d.values())\n",
    "    for name, vals in d.items():\n",
    "        print(f\"{name:>18}: \" + \" \".join(f\"{v:7.3f}\" for v in vals))\n",
    "# check_stage1()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Stage 2 \u2014 a rolling z-score detector\n",
    "For each metric: rolling mean and std over the previous `window` hours (no lookahead!);\n",
    "flag hours where |value - rolling_mean| > k * rolling_std. Tune window=48, k=5 to start.\n",
    "Return flagged (metric, hour) pairs. Beware the diurnal cycle \u2014 why does a 48h window\n",
    "survive it while a 6h window drowns in false alarms? (Answer it for yourself.)"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "def detect_anomalies(metrics, window=48, k=5.0):\n",
    "    \"\"\"Return dict metric_name -> sorted list of flagged hour indices.\"\"\"\n",
    "    raise NotImplementedError  # TODO(you)\n",
    "\n",
    "def check_stage2():\n",
    "    flags = detect_anomalies(METRICS)\n",
    "    flagged_metrics = {m for m, hours in flags.items() if hours}\n",
    "    print({m: hours[:5] for m, hours in flags.items() if hours})\n",
    "    assert \"cache_hit_rate\" in flagged_metrics and \"feature_null_rate\" in flagged_metrics\n",
    "# check_stage2()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Stage 3 \u2014 localize the incidents\n",
    "Cluster the flags into incidents (flags within 24h belong together), then for each incident\n",
    "decide the ROOT-CAUSE metric: the one whose anomaly starts EARLIEST in the cluster\n",
    "(causes precede symptoms). Produce (start_hour, root_metric) per incident."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "def localize_incidents(flags):\n",
    "    \"\"\"Return list of (start_hour, root_cause_metric), sorted by start_hour.\"\"\"\n",
    "    raise NotImplementedError  # TODO(you)\n",
    "\n",
    "def check_stage3():\n",
    "    incidents = localize_incidents(detect_anomalies(METRICS))\n",
    "    assert len(incidents) >= 2, \"there are two incidents\"\n",
    "    (h1, m1), (h2, m2) = incidents[0], incidents[-1]\n",
    "    assert abs(h1 - _INC1_HOUR) <= 3 and m1 == \"cache_hit_rate\", (h1, m1)\n",
    "    assert abs(h2 - _INC2_HOUR) <= 3 and m2 == \"feature_null_rate\", (h2, m2)\n",
    "    print(f\"incident 1 @h{h1} root={m1} | incident 2 @h{h2} root={m2}  \u2014 both correct\")\n",
    "# check_stage3()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Stage 4 \u2014 write the postmortems (then compare)\n",
    "For each incident write 5 lines: impact, detection, root cause, fix, prevention.\n",
    "Model answers below. Notice incident 2's shape: a SILENT failure \u2014 no errors anywhere,\n",
    "CTR sagged hours after the cause. Which alert would have caught it in minutes?\n",
    "\n",
    "**Stretch:** add an SRM-style check; build a causal graph (nulls -> score -> ctr) from\n",
    "lagged correlations; alert-budget tuning (cost of false page vs missed incident)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "# SOLUTIONS \u2014 no peeking until your attempt"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "def solution_daily_means(metrics):\n",
    "    return {name: values.reshape(14, 24).mean(axis=1) for name, values in metrics.items()}\n",
    "\n",
    "def solution_detect_anomalies(metrics, window=48, k=5.0):\n",
    "    flags = {}\n",
    "    for name, values in metrics.items():\n",
    "        hits = []\n",
    "        for hour in range(window, len(values)):\n",
    "            history = values[hour - window:hour]\n",
    "            mean, std = history.mean(), history.std()\n",
    "            if std > 0 and abs(values[hour] - mean) > k * std:\n",
    "                hits.append(hour)\n",
    "        flags[name] = hits\n",
    "    return flags\n",
    "\n",
    "def solution_localize_incidents(flags):\n",
    "    events = sorted((hour, metric) for metric, hours in flags.items() for hour in hours)\n",
    "    incidents, cluster = [], []\n",
    "    for hour, metric in events:\n",
    "        if cluster and hour - cluster[-1][0] > 24:\n",
    "            incidents.append(cluster)\n",
    "            cluster = []\n",
    "        cluster.append((hour, metric))\n",
    "    if cluster:\n",
    "        incidents.append(cluster)\n",
    "    return [(c[0][0], c[0][1]) for c in incidents]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Model postmortem 1 (deploy/cache):** 14:00 deploy invalidated the feature cache;\n",
    "hit rate 92%\u219267%; misses fan out to the feature service; p99 tripled while p50 barely\n",
    "moved (tail-shaped harm = dependency misses, not uniform slowdown). Fix: rollback +\n",
    "cache warmup in deploy pipeline. Prevention: p99 canary gate + warm-before-shift.\n",
    "\n",
    "**Model postmortem 2 (silent pipeline break):** upstream schema change at 03:00 made a\n",
    "parser default 4% of a key feature to null; scores drifted +0.05; CTR sagged 6h later.\n",
    "No system errors anywhere \u2014 the data layer was the only place it was visible early.\n",
    "Fix: pin schema, backfill feature, retrain-window exclusion for the corrupted span.\n",
    "Prevention: null-rate alert per feature (would have paged at 03:10), schema contracts."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "if __name__ == \"__main__\":\n",
    "    flags = solution_detect_anomalies(METRICS)\n",
    "    incidents = solution_localize_incidents(flags)\n",
    "    for hour, metric in incidents:\n",
    "        print(f\"incident at day {hour//24} {hour%24:02d}:00 \u2014 first anomalous metric: {metric}\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}