{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Mini RAG Lab: Retrieval-Augmented Generation \u2014 Retrieval Only\n",
    "\n",
    "**Goal**: Build a retrieval pipeline from scratch using only NumPy.\n",
    "No LLM needed \u2014 the lab focuses entirely on the retrieval step that makes RAG work.\n",
    "\n",
    "**Concepts exercised**:\n",
    "- TF-IDF construction from scratch (term frequency, inverse document frequency)\n",
    "- Cosine similarity and top-k retrieval\n",
    "- Evaluation metrics: Recall@k and Mean Reciprocal Rank (MRR)\n",
    "- Latent Semantic Analysis via SVD of the TF-IDF matrix\n",
    "- Hybrid retrieval with Reciprocal Rank Fusion (RRF)\n",
    "- Chunking strategies: whole-doc vs sentence-level\n",
    "\n",
    "**Estimated time**: 2-3 hours | **Difficulty**: Intermediate"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## How to work through this notebook\n",
    "\n",
    "1. Run the setup cell \u2014 it generates a synthetic corpus with known answer keys.\n",
    "2. Work through stages 1-6 in order. Each stage has a `# TODO(you):` skeleton.\n",
    "3. After implementing a stage, uncomment and run `check_stageN()` \u2014 all asserts must pass.\n",
    "4. Only peek at SOLUTIONS after a genuine attempt.\n",
    "5. The stretch goals at the bottom are optional but instructive."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import re\n",
    "from collections import Counter, defaultdict\n",
    "import matplotlib\n",
    "matplotlib.use('Agg')\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "rng = np.random.default_rng(42)\n",
    "\n",
    "# ---------------------------------------------------------------------------\n",
    "# Embedded mini-corpus (~40 short docs) with known Q->doc answer keys\n",
    "# ---------------------------------------------------------------------------\n",
    "RAW_DOCS = [\n",
    "    \"Machine learning models learn patterns from data without explicit programming.\",\n",
    "    \"Gradient descent optimizes model parameters by following the loss gradient downhill.\",\n",
    "    \"Neural networks stack layers of nonlinear transformations to learn complex features.\",\n",
    "    \"Backpropagation computes gradients through the network using the chain rule.\",\n",
    "    \"Overfitting occurs when a model memorizes training data and fails to generalize.\",\n",
    "    \"Regularization techniques like L1 and L2 penalize large weights to reduce overfitting.\",\n",
    "    \"Cross-validation splits data into folds to give unbiased model evaluation.\",\n",
    "    \"The bias-variance tradeoff describes the tension between underfitting and overfitting.\",\n",
    "    \"Decision trees split features recursively to partition data into pure leaf nodes.\",\n",
    "    \"Random forests aggregate many decision trees trained on bootstrap samples.\",\n",
    "    \"Boosting trains weak learners sequentially each correcting the previous errors.\",\n",
    "    \"Support vector machines find the maximum-margin hyperplane separating two classes.\",\n",
    "    \"The kernel trick maps inputs to high-dimensional space enabling nonlinear SVMs.\",\n",
    "    \"K-nearest neighbors classifies by majority vote among the k closest training points.\",\n",
    "    \"Naive Bayes applies Bayes theorem assuming feature independence given the class.\",\n",
    "    \"Logistic regression models the probability of a binary outcome using the sigmoid.\",\n",
    "    \"Softmax extends logistic regression to multi-class classification problems.\",\n",
    "    \"Attention mechanisms allow models to focus on relevant parts of the input sequence.\",\n",
    "    \"Transformers use self-attention across all token pairs replacing recurrence entirely.\",\n",
    "    \"BERT pretrains on masked language modeling then fine-tunes on downstream tasks.\",\n",
    "    \"GPT pretrains autoregressively predicting the next token in a text sequence.\",\n",
    "    \"Word embeddings map tokens to dense vectors capturing semantic relationships.\",\n",
    "    \"Cosine similarity measures the angle between two vectors ignoring their magnitude.\",\n",
    "    \"TF-IDF weights terms by frequency in a document and rarity across the corpus.\",\n",
    "    \"Retrieval-augmented generation fetches relevant docs before generating an answer.\",\n",
    "    \"Vector databases store embeddings and support fast approximate nearest neighbor search.\",\n",
    "    \"HNSW builds a hierarchical graph enabling logarithmic-time approximate search.\",\n",
    "    \"Quantization compresses model weights to lower precision reducing memory and latency.\",\n",
    "    \"Distillation trains a small student model to mimic a large teacher model.\",\n",
    "    \"Reinforcement learning from human feedback fine-tunes LLMs using preference data.\",\n",
    "    \"Reward models score outputs and provide signal for policy optimization in RLHF.\",\n",
    "    \"A/B testing compares two variants by randomly assigning users and measuring outcomes.\",\n",
    "    \"p-values measure how often a result as extreme as observed would appear by chance.\",\n",
    "    \"Confidence intervals give a plausible range for a parameter given the data.\",\n",
    "    \"The central limit theorem guarantees sample means are normally distributed at scale.\",\n",
    "    \"Causal inference distinguishes correlation from causation using do-calculus or RCTs.\",\n",
    "    \"Feature engineering transforms raw inputs into representations useful for models.\",\n",
    "    \"Data augmentation artificially expands training data by applying label-preserving transforms.\",\n",
    "    \"Batch normalization normalizes layer inputs per batch accelerating deep network training.\",\n",
    "    \"Dropout randomly zeroes activations during training acting as an ensemble method.\",\n",
    "]\n",
    "\n",
    "QUERIES = [\n",
    "    \"how do neural networks learn from data\",\n",
    "    \"what prevents models from overfitting\",\n",
    "    \"how does attention work in transformers\",\n",
    "    \"what is TF-IDF and cosine similarity\",\n",
    "    \"how does retrieval augmented generation work\",\n",
    "    \"explain gradient descent optimization\",\n",
    "    \"what are random forests and boosting\",\n",
    "    \"how does RLHF fine-tune language models\",\n",
    "]\n",
    "\n",
    "# Ground-truth: for each query, which doc indices are relevant (1 or 2 per query)\n",
    "GROUND_TRUTH = {\n",
    "    0: [2, 3],   # neural networks, backprop\n",
    "    1: [4, 5],   # overfitting, regularization\n",
    "    2: [17, 18], # attention, transformers\n",
    "    3: [22, 23], # cosine similarity, TF-IDF\n",
    "    4: [24, 25], # RAG, vector databases\n",
    "    5: [1],      # gradient descent\n",
    "    6: [9, 10],  # random forests, boosting\n",
    "    7: [29, 30], # RLHF, reward models\n",
    "}\n",
    "\n",
    "print(f\"Corpus: {len(RAW_DOCS)} docs, {len(QUERIES)} queries\")\n",
    "print(f\"Example doc: '{RAW_DOCS[0][:60]}...'\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Stage 1: Tokenize and Build Vocabulary\n",
    "\n",
    "Before we can compute TF-IDF we need a vocabulary \u2014 the set of all unique terms.\n",
    "Tokenization converts raw text into a list of normalized tokens. We lowercase and\n",
    "strip punctuation so \"Learning\" and \"learning\" become the same term. Common English\n",
    "stop words (the, a, of, ...) are removed because they appear in almost every document\n",
    "and carry little discriminative signal \u2014 including them would dilute TF-IDF weights.\n",
    "After tokenizing every document we collect the union of all tokens into a sorted\n",
    "vocabulary list and build a term->index mapping. The vocabulary size determines\n",
    "the width of our TF-IDF matrix. On this tiny corpus you should see roughly 150-200\n",
    "unique non-stop-word terms."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "STOP_WORDS = set(\"\"\"\n",
    "a an the and or but in on at to for of with by from is are was were be been\n",
    "being have has had do does did will would could should may might shall can\n",
    "this that these those it its they them their s not no as into more most\n",
    "each other also both all any some such than then when where how what who\n",
    "\"\"\".split())\n",
    "\n",
    "def tokenize(text):\n",
    "    # TODO(you): lowercase the text, split on non-alpha characters, filter tokens\n",
    "    # shorter than 2 chars and tokens in STOP_WORDS.\n",
    "    # Return a list of token strings.\n",
    "    # Hint: use re.split(r'[^a-z]+', text.lower())\n",
    "    raise NotImplementedError\n",
    "\n",
    "def build_vocab(docs):\n",
    "    # TODO(you): tokenize every doc in docs, collect all unique tokens,\n",
    "    # return (vocab_list sorted alphabetically, term2idx dict).\n",
    "    raise NotImplementedError\n",
    "\n",
    "def check_stage1():\n",
    "    tokens = tokenize(\"Machine learning models learn patterns!\")\n",
    "    assert isinstance(tokens, list), \"tokenize must return a list\"\n",
    "    assert all(t == t.lower() for t in tokens), \"tokens must be lowercase\"\n",
    "    assert \"the\" not in tokens and \"a\" not in tokens, \"stop words must be removed\"\n",
    "    assert \"machine\" in tokens and \"learning\" in tokens, \"content words must remain\"\n",
    "\n",
    "    vocab, t2i = build_vocab(RAW_DOCS)\n",
    "    assert len(vocab) > 100, f\"vocab too small: {len(vocab)}\"\n",
    "    assert vocab == sorted(vocab), \"vocab must be sorted\"\n",
    "    assert t2i[vocab[0]] == 0, \"t2i must be a 0-based index\"\n",
    "    assert all(t2i[w] == i for i, w in enumerate(vocab)), \"t2i inconsistent\"\n",
    "    print(f\"Stage 1 OK \u2014 vocab size: {len(vocab)}\")\n",
    "\n",
    "# check_stage1()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Stage 2: Build TF-IDF Matrix\n",
    "\n",
    "TF-IDF assigns each (document, term) pair a weight that captures two ideas.\n",
    "Term Frequency (TF) measures how often a term appears in a document relative to\n",
    "its length \u2014 frequent terms in a doc are likely important to that doc.\n",
    "Inverse Document Frequency (IDF) measures how rare a term is across the corpus \u2014\n",
    "terms appearing in every document (like \"data\") carry little discriminative power.\n",
    "We use log-IDF: IDF(t) = log((N+1)/(df(t)+1)) + 1, which smooths and avoids\n",
    "division by zero. The TF-IDF matrix is (N_docs x V) \u2014 typically very sparse.\n",
    "Normalizing each row to unit L2 norm means retrieval reduces to a dot product."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "def build_tfidf(docs, vocab, t2i):\n",
    "    # TODO(you): build an (N, V) float64 TF-IDF matrix where\n",
    "    #   TF[d,t]  = count(t in doc d) / len(tokens in doc d)\n",
    "    #   IDF[t]   = log((N+1) / (df[t]+1)) + 1   (N = num docs, df = doc frequency)\n",
    "    #   weight   = TF * IDF, then L2-normalize each row.\n",
    "    # Return the matrix (N x V).\n",
    "    raise NotImplementedError\n",
    "\n",
    "def check_stage2():\n",
    "    vocab, t2i = build_vocab(RAW_DOCS)\n",
    "    mat = build_tfidf(RAW_DOCS, vocab, t2i)\n",
    "    N, V = mat.shape\n",
    "    assert N == len(RAW_DOCS), \"wrong number of rows\"\n",
    "    assert V == len(vocab), \"wrong number of columns\"\n",
    "    norms = np.linalg.norm(mat, axis=1)\n",
    "    assert np.allclose(norms, 1.0, atol=1e-6), \"rows must be L2-normalized\"\n",
    "    assert mat.dtype == np.float64, \"use float64\"\n",
    "    print(f\"Stage 2 OK \u2014 TF-IDF matrix: {mat.shape}, non-zero: {(mat > 0).sum()}\")\n",
    "\n",
    "# check_stage2()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Stage 3: Retrieval and Evaluation (Recall@k, MRR)\n",
    "\n",
    "Retrieval means: given a query, tokenize and TF-IDF-encode it identically to the\n",
    "documents, then rank documents by cosine similarity (which is just the dot product\n",
    "when both vectors are L2-normalized). The top-k documents are the retrieved set.\n",
    "Recall@k = fraction of relevant docs found in top-k. If there are 2 relevant docs\n",
    "and k=5, recall@5=1.0 only if both appear in the top 5.\n",
    "Mean Reciprocal Rank (MRR) = average of 1/rank_of_first_relevant_doc across queries.\n",
    "If the first relevant doc is ranked #2, MRR contribution is 0.5. MRR rewards putting\n",
    "at least one good result near the top. Both metrics are standard in IR evaluation."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "def encode_query(query, vocab, t2i, idf):\n",
    "    # TODO(you): tokenize the query, build a TF vector (counts/len), multiply by idf,\n",
    "    # L2-normalize. Return a 1D array of length V.\n",
    "    # idf is a 1D array of length V (precomputed from the corpus).\n",
    "    raise NotImplementedError\n",
    "\n",
    "def retrieve_tfidf(query, tfidf_mat, vocab, t2i, idf, k=5):\n",
    "    # TODO(you): encode the query, compute dot products with all doc rows,\n",
    "    # return the indices of the top-k docs (descending similarity order).\n",
    "    raise NotImplementedError\n",
    "\n",
    "def recall_at_k(retrieved, relevant):\n",
    "    # TODO(you): fraction of relevant docs in retrieved (both are lists/arrays of indices).\n",
    "    raise NotImplementedError\n",
    "\n",
    "def mrr(retrieved, relevant):\n",
    "    # TODO(you): 1/rank of the first relevant doc found in retrieved, else 0.\n",
    "    # rank is 1-indexed.\n",
    "    raise NotImplementedError\n",
    "\n",
    "def eval_retriever(retrieve_fn, vocab, t2i, idf, k=5):\n",
    "    recalls, mrrs = [], []\n",
    "    for qi, query in enumerate(QUERIES):\n",
    "        relevant = GROUND_TRUTH[qi]\n",
    "        retrieved = retrieve_fn(query, k=k)\n",
    "        recalls.append(recall_at_k(retrieved, relevant))\n",
    "        mrrs.append(mrr(retrieved, relevant))\n",
    "    return np.mean(recalls), np.mean(mrrs)\n",
    "\n",
    "def check_stage3():\n",
    "    vocab, t2i = build_vocab(RAW_DOCS)\n",
    "    mat = build_tfidf(RAW_DOCS, vocab, t2i)\n",
    "    N, V = mat.shape\n",
    "    # recompute idf outside of build_tfidf for encode_query\n",
    "    token_lists = [tokenize(d) for d in RAW_DOCS]\n",
    "    df = np.zeros(V)\n",
    "    for toks in token_lists:\n",
    "        for t in set(toks):\n",
    "            if t in t2i:\n",
    "                df[t2i[t]] += 1\n",
    "    idf = np.log((N + 1) / (df + 1)) + 1\n",
    "\n",
    "    q_vec = encode_query(\"gradient descent optimization\", vocab, t2i, idf)\n",
    "    assert q_vec.shape == (V,), \"wrong query vector shape\"\n",
    "    assert abs(np.linalg.norm(q_vec) - 1.0) < 1e-6, \"query vector must be L2-normalized\"\n",
    "\n",
    "    retrieve_fn = lambda q, k: retrieve_tfidf(q, mat, vocab, t2i, idf, k)\n",
    "    r, m = eval_retriever(retrieve_fn, vocab, t2i, idf, k=5)\n",
    "    print(f\"Stage 3 OK \u2014 TF-IDF Recall@5: {r:.3f}, MRR: {m:.3f}\")\n",
    "    assert r > 0.4, f\"TF-IDF recall too low: {r:.3f} (check your implementation)\"\n",
    "\n",
    "# check_stage3()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Stage 4: Latent Semantic Analysis via SVD\n",
    "\n",
    "TF-IDF is a bag-of-words model: \"optimization\" and \"optimizes\" are different terms\n",
    "even though they mean the same thing. Latent Semantic Analysis (LSA) addresses this\n",
    "by finding low-rank structure in the TF-IDF matrix using Singular Value Decomposition.\n",
    "M = U * S * Vt where U is (N x r), S is (r x r) diagonal, Vt is (r x V).\n",
    "Projecting documents into the top-r singular vectors gives dense \"semantic\" embeddings\n",
    "that capture co-occurrence patterns \u2014 documents about the same topic cluster together\n",
    "even if they use different words. We use np.linalg.svd with full_matrices=False and\n",
    "keep only the top r=20 components. Query embedding works the same way: encode as\n",
    "TF-IDF then project: q_emb = q_tfidf @ Vt.T (the first r rows of Vt)."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "def build_lsa(tfidf_mat, r=20):\n",
    "    # TODO(you): compute the compact SVD of tfidf_mat, keep top-r components.\n",
    "    # Return (doc_embs, Vt_r, s_r) where doc_embs = U[:,:r] * s_r (N x r matrix),\n",
    "    # Vt_r is the top-r rows of Vt (r x V), s_r is the top-r singular values.\n",
    "    raise NotImplementedError\n",
    "\n",
    "def retrieve_lsa(query, doc_embs, Vt_r, vocab, t2i, idf, k=5):\n",
    "    # TODO(you): encode query as TF-IDF vector (reuse encode_query),\n",
    "    # project into LSA space: q_lsa = q_vec @ Vt_r.T,\n",
    "    # L2-normalize both q_lsa and each row of doc_embs (or pre-normalize doc_embs),\n",
    "    # return top-k indices by cosine similarity.\n",
    "    raise NotImplementedError\n",
    "\n",
    "def check_stage4():\n",
    "    vocab, t2i = build_vocab(RAW_DOCS)\n",
    "    mat = build_tfidf(RAW_DOCS, vocab, t2i)\n",
    "    N, V = mat.shape\n",
    "    token_lists = [tokenize(d) for d in RAW_DOCS]\n",
    "    df = np.zeros(V)\n",
    "    for toks in token_lists:\n",
    "        for t in set(toks):\n",
    "            if t in t2i:\n",
    "                df[t2i[t]] += 1\n",
    "    idf = np.log((N + 1) / (df + 1)) + 1\n",
    "\n",
    "    doc_embs, Vt_r, s_r = build_lsa(mat, r=20)\n",
    "    assert doc_embs.shape == (N, 20), f\"wrong doc_embs shape: {doc_embs.shape}\"\n",
    "    assert Vt_r.shape == (20, V), f\"wrong Vt_r shape: {Vt_r.shape}\"\n",
    "\n",
    "    retrieve_fn = lambda q, k: retrieve_lsa(q, doc_embs, Vt_r, vocab, t2i, idf, k)\n",
    "    r, m = eval_retriever(retrieve_fn, vocab, t2i, idf, k=5)\n",
    "    print(f\"Stage 4 OK \u2014 LSA Recall@5: {r:.3f}, MRR: {m:.3f}\")\n",
    "\n",
    "# check_stage4()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Stage 5: Hybrid Retrieval with Reciprocal Rank Fusion\n",
    "\n",
    "Neither TF-IDF nor LSA dominates on all queries. TF-IDF wins when the query uses\n",
    "exact words from the document; LSA wins when the query uses synonyms or related\n",
    "concepts. Reciprocal Rank Fusion (RRF) combines ranked lists from multiple retrievers\n",
    "without needing calibrated scores. For each document d and retriever i, RRF score:\n",
    "  score(d) = sum over i of 1/(k_rrf + rank_i(d))\n",
    "where k_rrf is a smoothing constant (typically 60) and rank is 1-indexed.\n",
    "Documents not retrieved by a system get rank = infinity, contributing 0.\n",
    "RRF is robust, parameter-light, and often beats learned fusion on small datasets.\n",
    "It is widely used in production hybrid search (e.g., Elasticsearch's RRF API)."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "def rrf_fuse(ranked_lists, k_rrf=60, n_docs=None):\n",
    "    # TODO(you): ranked_lists is a list of lists of doc indices (each already ranked).\n",
    "    # Compute RRF score for each doc across all lists.\n",
    "    # Return a ranked list of doc indices (highest RRF score first).\n",
    "    # n_docs is the total number of documents (use len(RAW_DOCS) if None).\n",
    "    raise NotImplementedError\n",
    "\n",
    "def retrieve_hybrid(query, tfidf_mat, doc_embs, Vt_r, vocab, t2i, idf, k=5, k_rrf=60):\n",
    "    # TODO(you): get top-2k results from both TF-IDF and LSA retrievers,\n",
    "    # fuse with RRF, return top-k.\n",
    "    raise NotImplementedError\n",
    "\n",
    "def check_stage5():\n",
    "    vocab, t2i = build_vocab(RAW_DOCS)\n",
    "    mat = build_tfidf(RAW_DOCS, vocab, t2i)\n",
    "    N, V = mat.shape\n",
    "    token_lists = [tokenize(d) for d in RAW_DOCS]\n",
    "    df = np.zeros(V)\n",
    "    for toks in token_lists:\n",
    "        for t in set(toks):\n",
    "            if t in t2i:\n",
    "                df[t2i[t]] += 1\n",
    "    idf = np.log((N + 1) / (df + 1)) + 1\n",
    "    doc_embs, Vt_r, s_r = build_lsa(mat, r=20)\n",
    "\n",
    "    fused = rrf_fuse([[0, 2, 5], [2, 0, 3]], k_rrf=60, n_docs=10)\n",
    "    assert fused[0] in [0, 2], \"top-1 should be 0 or 2 (tied leaders)\"\n",
    "\n",
    "    retrieve_fn = lambda q, k: retrieve_hybrid(q, mat, doc_embs, Vt_r, vocab, t2i, idf, k)\n",
    "    r, m = eval_retriever(retrieve_fn, vocab, t2i, idf, k=5)\n",
    "    print(f\"Stage 5 OK \u2014 Hybrid RRF Recall@5: {r:.3f}, MRR: {m:.3f}\")\n",
    "\n",
    "# check_stage5()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Stage 6: Chunking Experiment and Failure Analysis\n",
    "\n",
    "In real RAG systems documents are split into smaller chunks before indexing.\n",
    "Chunking at the sentence level often improves precision \u2014 a retrieved sentence is\n",
    "more focused than a whole paragraph. But it can hurt recall if the answer spans\n",
    "multiple sentences. We simulate this by treating each sentence of each doc as its\n",
    "own chunk. We then re-run TF-IDF retrieval over chunks and measure whether we\n",
    "recover the relevant docs (mapping chunk hits back to their parent doc).\n",
    "For the failure analysis we manually inspect two cases from the query set:\n",
    "one where TF-IDF beats LSA (exact keyword match) and one where LSA beats TF-IDF\n",
    "(semantic similarity without shared terms). Understanding failure modes is essential\n",
    "before deciding what retriever to deploy."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "def chunk_corpus(docs, mode=\"sentence\"):\n",
    "    # TODO(you): if mode==\"sentence\", split each doc on '. ' boundaries and return\n",
    "    # (chunks list, chunk_to_doc list) where chunk_to_doc[i] is the doc index of chunk i.\n",
    "    # if mode==\"whole\", return (docs, list(range(len(docs)))).\n",
    "    raise NotImplementedError\n",
    "\n",
    "def recall_chunked(query, vocab, t2i, idf_c, chunk_mat, chunk_to_doc, relevant_docs, k=5):\n",
    "    # TODO(you): retrieve top-k chunks, map to parent docs (deduplicated),\n",
    "    # compute recall@k against relevant_docs.\n",
    "    raise NotImplementedError\n",
    "\n",
    "def check_stage6():\n",
    "    chunks, c2d = chunk_corpus(RAW_DOCS, mode=\"sentence\")\n",
    "    assert len(chunks) >= len(RAW_DOCS), \"sentence mode must produce at least as many chunks\"\n",
    "    assert len(chunks) == len(c2d), \"chunk_to_doc must be same length as chunks\"\n",
    "    assert c2d[0] == 0, \"first chunk must belong to doc 0\"\n",
    "\n",
    "    # Build TF-IDF over chunks\n",
    "    vocab_c, t2i_c = build_vocab(chunks)\n",
    "    mat_c = build_tfidf(chunks, vocab_c, t2i_c)\n",
    "    Nc, Vc = mat_c.shape\n",
    "    token_lists_c = [tokenize(ch) for ch in chunks]\n",
    "    df_c = np.zeros(Vc)\n",
    "    for toks in token_lists_c:\n",
    "        for t in set(toks):\n",
    "            if t in t2i_c:\n",
    "                df_c[t2i_c[t]] += 1\n",
    "    idf_c = np.log((Nc + 1) / (df_c + 1)) + 1\n",
    "\n",
    "    # Spot-check one query\n",
    "    q = QUERIES[5]  # \"explain gradient descent optimization\"\n",
    "    r = recall_chunked(q, vocab_c, t2i_c, idf_c, mat_c, c2d, GROUND_TRUTH[5], k=5)\n",
    "    print(f\"Stage 6 OK \u2014 chunked recall@5 for query 5: {r:.3f}\")\n",
    "\n",
    "    # Failure analysis printout\n",
    "    vocab, t2i = build_vocab(RAW_DOCS)\n",
    "    mat = build_tfidf(RAW_DOCS, vocab, t2i)\n",
    "    N, V = mat.shape\n",
    "    token_lists = [tokenize(d) for d in RAW_DOCS]\n",
    "    df = np.zeros(V)\n",
    "    for toks in token_lists:\n",
    "        for t in set(toks):\n",
    "            if t in t2i:\n",
    "                df[t2i[t]] += 1\n",
    "    idf = np.log((N + 1) / (df + 1)) + 1\n",
    "    doc_embs, Vt_r, s_r = build_lsa(mat, r=20)\n",
    "\n",
    "    print(\"\\n--- Failure analysis ---\")\n",
    "    for qi in range(len(QUERIES)):\n",
    "        tfidf_ret = retrieve_tfidf(QUERIES[qi], mat, vocab, t2i, idf, k=5).tolist()\n",
    "        lsa_ret = retrieve_lsa(QUERIES[qi], doc_embs, Vt_r, vocab, t2i, idf, k=5)\n",
    "        rel = GROUND_TRUTH[qi]\n",
    "        tr = recall_at_k(tfidf_ret, rel)\n",
    "        lr = recall_at_k(lsa_ret, rel)\n",
    "        if abs(tr - lr) > 0.3:\n",
    "            winner = \"TF-IDF\" if tr > lr else \"LSA\"\n",
    "            print(f\"  Q{qi}: '{QUERIES[qi][:40]}' -> {winner} wins ({tr:.1f} vs {lr:.1f})\")\n",
    "\n",
    "# check_stage6()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Stretch Goals\n",
    "\n",
    "1. **BM25**: Replace TF-IDF with BM25 scoring (add k1 and b saturation parameters).\n",
    "   Compare Recall@5 and MRR against vanilla TF-IDF.\n",
    "2. **Query expansion**: For each query, add the top-3 TF-IDF-similar doc's most\n",
    "   distinctive terms (by IDF) and re-retrieve. Does it help?\n",
    "3. **Inverted index**: Build a sparse inverted index {term: [doc_ids]} and time\n",
    "   keyword retrieval vs the dense matrix multiply approach.\n",
    "4. **LSA dimension sweep**: Plot MRR vs r (number of LSA components) from 2 to 39.\n",
    "   Where does it peak, and why does it degrade at very high r?\n",
    "5. **GPT-2-style regex pre-tokenization**: Implement the pattern\n",
    "   `r\"'s|'t|'re|... | ?\\w+| ?[^\\s\\w]+\"` and show how it changes token boundaries\n",
    "   compared to your simple alpha-split tokenizer."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "# SOLUTIONS -- no peeking until your attempt\n",
    "---"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# --- Stage 1 solution ---\n",
    "def solution_tokenize(text):\n",
    "    tokens = re.split(r'[^a-z]+', text.lower())\n",
    "    return [t for t in tokens if len(t) >= 2 and t not in STOP_WORDS]\n",
    "\n",
    "def solution_build_vocab(docs):\n",
    "    all_tokens = set()\n",
    "    for doc in docs:\n",
    "        all_tokens.update(solution_tokenize(doc))\n",
    "    vocab = sorted(all_tokens)\n",
    "    t2i = {w: i for i, w in enumerate(vocab)}\n",
    "    return vocab, t2i"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# --- Stage 2 solution ---\n",
    "def solution_build_tfidf(docs, vocab, t2i):\n",
    "    N = len(docs)\n",
    "    V = len(vocab)\n",
    "    mat = np.zeros((N, V), dtype=np.float64)\n",
    "    token_lists = [solution_tokenize(d) for d in docs]\n",
    "    for i, toks in enumerate(token_lists):\n",
    "        if not toks:\n",
    "            continue\n",
    "        counts = Counter(toks)\n",
    "        for term, cnt in counts.items():\n",
    "            if term in t2i:\n",
    "                mat[i, t2i[term]] = cnt / len(toks)\n",
    "    df = np.zeros(V)\n",
    "    for toks in token_lists:\n",
    "        for t in set(toks):\n",
    "            if t in t2i:\n",
    "                df[t2i[t]] += 1\n",
    "    idf = np.log((N + 1) / (df + 1)) + 1\n",
    "    mat = mat * idf\n",
    "    norms = np.linalg.norm(mat, axis=1, keepdims=True)\n",
    "    norms[norms == 0] = 1.0\n",
    "    return mat / norms"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# --- Stage 3 solution ---\n",
    "def _compute_idf(docs, vocab, t2i):\n",
    "    N = len(docs)\n",
    "    V = len(vocab)\n",
    "    df = np.zeros(V)\n",
    "    for doc in docs:\n",
    "        toks = solution_tokenize(doc)\n",
    "        for t in set(toks):\n",
    "            if t in t2i:\n",
    "                df[t2i[t]] += 1\n",
    "    return np.log((N + 1) / (df + 1)) + 1\n",
    "\n",
    "def solution_encode_query(query, vocab, t2i, idf):\n",
    "    toks = solution_tokenize(query)\n",
    "    V = len(vocab)\n",
    "    vec = np.zeros(V, dtype=np.float64)\n",
    "    if not toks:\n",
    "        return vec\n",
    "    counts = Counter(toks)\n",
    "    for term, cnt in counts.items():\n",
    "        if term in t2i:\n",
    "            vec[t2i[term]] = cnt / len(toks)\n",
    "    vec = vec * idf\n",
    "    norm = np.linalg.norm(vec)\n",
    "    return vec / norm if norm > 0 else vec\n",
    "\n",
    "def solution_retrieve_tfidf(query, tfidf_mat, vocab, t2i, idf, k=5):\n",
    "    q_vec = solution_encode_query(query, vocab, t2i, idf)\n",
    "    scores = tfidf_mat @ q_vec\n",
    "    return np.argsort(scores)[::-1][:k]\n",
    "\n",
    "def solution_recall_at_k(retrieved, relevant):\n",
    "    retrieved_set = set(retrieved)\n",
    "    hits = sum(1 for r in relevant if r in retrieved_set)\n",
    "    return hits / len(relevant) if relevant else 0.0\n",
    "\n",
    "def solution_mrr(retrieved, relevant):\n",
    "    relevant_set = set(relevant)\n",
    "    for rank, doc_id in enumerate(retrieved, start=1):\n",
    "        if doc_id in relevant_set:\n",
    "            return 1.0 / rank\n",
    "    return 0.0"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# --- Stage 4 solution ---\n",
    "def solution_build_lsa(tfidf_mat, r=20):\n",
    "    U, s, Vt = np.linalg.svd(tfidf_mat, full_matrices=False)\n",
    "    U_r, s_r, Vt_r = U[:, :r], s[:r], Vt[:r, :]\n",
    "    doc_embs = U_r * s_r\n",
    "    return doc_embs, Vt_r, s_r\n",
    "\n",
    "def solution_retrieve_lsa(query, doc_embs, Vt_r, vocab, t2i, idf, k=5):\n",
    "    q_tfidf = solution_encode_query(query, vocab, t2i, idf)\n",
    "    q_lsa = q_tfidf @ Vt_r.T\n",
    "    q_norm = np.linalg.norm(q_lsa)\n",
    "    if q_norm > 0:\n",
    "        q_lsa = q_lsa / q_norm\n",
    "    doc_norms = np.linalg.norm(doc_embs, axis=1, keepdims=True)\n",
    "    doc_norms[doc_norms == 0] = 1.0\n",
    "    doc_embs_normed = doc_embs / doc_norms\n",
    "    scores = doc_embs_normed @ q_lsa\n",
    "    return np.argsort(scores)[::-1][:k]"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# --- Stage 5 solution ---\n",
    "def solution_rrf_fuse(ranked_lists, k_rrf=60, n_docs=None):\n",
    "    if n_docs is None:\n",
    "        n_docs = len(RAW_DOCS)\n",
    "    scores = defaultdict(float)\n",
    "    for ranked in ranked_lists:\n",
    "        for rank, doc_id in enumerate(ranked, start=1):\n",
    "            scores[doc_id] += 1.0 / (k_rrf + rank)\n",
    "    all_docs = list(range(n_docs))\n",
    "    return sorted(all_docs, key=lambda d: scores[d], reverse=True)\n",
    "\n",
    "def solution_retrieve_hybrid(query, tfidf_mat, doc_embs, Vt_r, vocab, t2i, idf, k=5, k_rrf=60):\n",
    "    n = len(RAW_DOCS)\n",
    "    tfidf_ranked = solution_retrieve_tfidf(query, tfidf_mat, vocab, t2i, idf, k=2*k).tolist()\n",
    "    lsa_ranked = solution_retrieve_lsa(query, doc_embs, Vt_r, vocab, t2i, idf, k=2*k).tolist()\n",
    "    fused = solution_rrf_fuse([tfidf_ranked, lsa_ranked], k_rrf=k_rrf, n_docs=n)\n",
    "    return fused[:k]"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# --- Stage 6 solution ---\n",
    "def solution_chunk_corpus(docs, mode=\"sentence\"):\n",
    "    if mode == \"whole\":\n",
    "        return docs, list(range(len(docs)))\n",
    "    chunks, chunk_to_doc = [], []\n",
    "    for doc_idx, doc in enumerate(docs):\n",
    "        sentences = [s.strip() for s in doc.split('. ') if s.strip()]\n",
    "        if not sentences:\n",
    "            sentences = [doc]\n",
    "        for sent in sentences:\n",
    "            chunks.append(sent)\n",
    "            chunk_to_doc.append(doc_idx)\n",
    "    return chunks, chunk_to_doc\n",
    "\n",
    "def solution_recall_chunked(query, vocab, t2i, idf, chunk_mat, chunk_to_doc, relevant_docs, k=5):\n",
    "    q_vec = solution_encode_query(query, vocab, t2i, idf)\n",
    "    scores = chunk_mat @ q_vec\n",
    "    top_chunk_indices = np.argsort(scores)[::-1][:k]\n",
    "    retrieved_docs = list(dict.fromkeys(chunk_to_doc[ci] for ci in top_chunk_indices))\n",
    "    return solution_recall_at_k(retrieved_docs, relevant_docs)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}