SAFETY PILLAR · ANTHROPIC-CRITICAL

Mechanistic interpretability

Anthropic's bet: if we can read circuits the way we read code, we can certify alignment instead of just hoping for it. This page is the working vocabulary every onsite loop assumes you already have.

Read ~30 min Asked at Anthropic, OpenAI, DeepMind, Goodfire Difficulty Sr Staff bar

What you'll learn

Why mech interp matters at Anthropic
The residual stream — the bus through every layer
Decomposing attention — QK and OV circuits
Induction heads — the canonical 2-layer circuit
Superposition — packing more features than dimensions
Sparse autoencoders — discovering monosemantic features
Activation & attribution patching — causal interventions
Steering vectors — additive interventions on the residual
Influence functions — tracing predictions to training data
Circuit case studies — IOI, refusal, modular addition
Recent Anthropic work to skim before any onsite

MOTIVATION · WHY THIS PILLAR

Why mech interp matters at Anthropic

TL;DR

Behavioral evals can be gamed; circuits can't. Anthropic's safety bet is that reverse-engineering the computation inside frontier models lets us detect alignment failures (sleeper agents, alignment faking, reward hacking) before they ship — and ultimately gives us a science of model internals strong enough to certify safety claims.

The behavioral-eval ceiling

RLHF + red-teaming improve outputs but cannot prove the model isn't behaving conditionally. Sleeper Agents (Hubinger 2024) showed RLHF can leave a backdoor that survives every standard safety pass. Alignment Faking (Greenblatt 2024) showed Claude will strategically pretend to be aligned during training. Both findings imply: any safety story that depends only on observed behavior is unfalsifiable.

What "reading the model" enables

Detect concepts (deception, harm, refusal) as features, not just symptoms.
Causally intervene — clamp a feature, ablate a direction, edit a circuit.
Trace outputs to training documents (influence functions).
Build classifiers from internal features (Constitutional Classifiers, 2025).

REMEMBER

Mech interp is Anthropic's load-bearing safety bet — every onsite probes whether you take it seriously.
The frame to use: behavioral safety is necessary but insufficient; circuit-level evidence closes the gap.
Read transformer-circuits.pub before walking into the building.

FOUNDATIONS · ARCHITECTURE

The residual stream — the bus that runs through every layer

TL;DR

The residual stream is a basis-free linear vector space shared across every layer. Each block reads from it, computes, and writes back. Linearity is the foundation of mech interp — without it, additive attribution would not work.

The two properties that matter

Linear superposition: the residual at position p is the sum of contributions from every prior block plus the input embedding. Linear — so you can decompose contributions additively.
Subspaces: different "features" live in roughly orthogonal directions. Two heads can write to disjoint subspaces and not interfere.

residual stream — the running sum x = embed + Σ block_i(LN(x)) that every block reads from and writes to. Each block's output is added (not replaced) so all earlier signal is preserved.

This is the foundation of every mech interp technique: if features were not roughly linearly composable in the residual stream, we wouldn't be able to do attribution.

REMEMBER

Residual stream = shared communication bus, not a "hidden state."
Features live in directions, not in basis vectors.
Linearity makes additive decomposition (and patching) possible.

FOUNDATIONS · ATTENTION

Decomposing attention — QK and OV circuits

TL;DR

Each attention head factors into two independent circuits: QK decides where to look, OV decides what to copy. Reading a head's function reduces to inspecting these two matrices. From Elhage et al. 2021's "A Mathematical Framework for Transformer Circuits."

The two matrices

QK circuit: W_Q · W_K^T — determines which tokens attend to which. Operates on the residual stream pre-attention.
OV circuit: W_V · W_O — determines what gets written when attention happens. Operates on the residual stream post-attention.

This decomposition is the key to reading a head's "function": look at OV to see what it copies, look at QK to see what triggers it. Independent factorization means you can analyze them separately.

REMEMBER

QK = pattern matching ("which positions attend").
OV = information transport ("what gets written").
Independent — analyze separately to read a head's role.

CIRCUIT · CANONICAL EXAMPLE

Induction heads — the canonical 2-layer circuit

TL;DR

An induction head is a 2-layer attention circuit that implements pattern completion: given ...[A][B]...[A], predict [B] next. Their formation is a phase transition that coincides with the model gaining most of its in-context learning ability. From Olsson et al. 2022.

THE INSIGHT

Induction heads = the mechanistic origin of in-context learning

Before induction heads form, loss-vs-position is flat — context past tokens doesn't help. The moment they form, the curve drops steeply: longer context now monotonically lowers loss. ICL is not magic; it's a specific 2-head circuit that crystallizes during training.

How the two heads compose

Layer 1 — a "previous-token" head copies the token at position t−1 into the residual at position t.
Layer 2 — an "induction" head queries on the current token's content; its keys match the previous-token info written by layer 1, so it attends to the position right after a previous occurrence of the current token.
Its OV circuit copies that next-token information forward → predicts B given A.

The phase transition

Induction heads emerge sharply at a specific scale during training. Their formation coincides with the model gaining most of its in-context-learning ability. Before they form, the loss-vs-context-position curve is flat; after, the curve drops steeply with longer context.

REMEMBER

2-head circuit: previous-token head + induction head.
Mechanism = pattern completion ...[A][B]...[A] → [B].
Forms via phase transition; coincides with ICL ability appearing.

REPRESENTATION · WHY NEURONS ARE MESSY

Superposition — packing more features than dimensions

TL;DR

Models pack more features than dimensions via near-orthogonal directions. Sparse features rarely collide, so you can store many in a smaller-dim space. Implication: single neurons are polysemantic; the monosemantic units are directions in activation space, not basis vectors. From Elhage et al. 2022.

The implications that drive every other technique

Single neurons are usually polysemantic (respond to many unrelated features). The "feature direction" is what's monosemantic, not the basis vector.
Probing (training a linear classifier on activations) works because features are linear; but you don't know which directions to probe.
Sparse autoencoders solve this by discovering the feature directions.

REMEMBER

Features-per-dim > 1 is the rule, not the exception.
Polysemantic neurons follow directly — never trust a single-neuron interpretation.
Superposition is what makes SAEs the natural unsupervised tool.

METHOD · 2024 BREAKTHROUGH

Sparse autoencoders — discovering monosemantic features

TL;DR

Train an overcomplete sparse autoencoder on residual-stream activations. The dictionary atoms it learns are interpretable features. Templeton 2024 scaled this to Claude 3 Sonnet and found ~10M monosemantic features including "Golden Gate Bridge", "code bug", "sycophancy". Bricken 2023 + Templeton 2024 are the canon.

THE INSIGHT

SAEs make superposition operationally tractable

Probes need you to know the direction. SAEs learn the directions. An overcomplete sparse dictionary forces the model to use one (or few) atoms per input — if a single atom fires consistently for "Golden Gate Bridge" mentions across thousands of inputs, that atom is the monosemantic feature for that concept. This is the closest thing to "naming the variables" inside a frontier model.

The architecture

Encoder: f(x) = ReLU(W_enc · (x − b_dec)) — projects d-dim activation into D-dim feature space (D ≫ d, e.g. 32× wider).
Decoder: x̂ = W_dec · f(x) + b_dec — reconstructs.
Loss: reconstruction MSE + sparsity penalty (L1 on f(x), or top-K constraint).

Variants you should know

Top-K SAE (OpenAI, 2024): replace L1 with hard top-K constraint at encoder output. Cleaner sparsity, no shrinkage bias.
JumpReLU SAE: learnable per-feature threshold replaces ReLU. Better dead-feature behavior.
Gated SAE: separate "gate" and "magnitude" pathways.

PITFALL — DEAD FEATURES

Many SAE features stop firing during training and never recover. They consume capacity without contributing reconstruction. Mitigations: ghost gradients (Anthropic — push dead features toward unexplained variance), auxiliary losses, periodic resampling (re-init dead features), or switch to top-K / JumpReLU which sidestep the L1 shrinkage that creates them.

REMEMBER

SAE = overcomplete sparse dictionary on residual activations.
Atoms = candidate monosemantic features.
Top-K / JumpReLU beat L1 by avoiding shrinkage and dead features.
Templeton 2024 scaled to ~10M features on Claude 3 Sonnet.

METHOD · CAUSAL INTERVENTION

Activation & attribution patching — causal interventions

TL;DR

Correlation isn't causation, even for activations. Patching swaps activations between clean and corrupt runs to test whether a specific component is causally responsible for a behavior. Activation patching is the gold-standard test; attribution patching is its fast linear approximation; path patching isolates specific information flows.

The three flavors

Activation patching: run forward on a "clean" input; cache activations; run on a "corrupt" input but swap in the clean activation at one location; see if behavior recovers. If yes → that location is causally important.
Attribution patching: linear approximation of activation patching using gradients — much faster, can scan many locations at once.
Path patching: patch only specific paths (e.g., "head L4H7's output → head L8H2's value") to isolate directions of information flow.

Workflow in practice: use attribution patching to scan hundreds of components cheaply, then use activation/path patching to verify the few that survive.

REMEMBER

Activation = accurate but slow; attribution = approximate but fast.
Path patching isolates flows between specific components.
Always finish with the slow accurate check before claiming a circuit.

METHOD · CONTROL

Steering vectors — additive interventions on the residual stream

TL;DR

Find a direction v in the residual stream associated with a behavior; add α · v at inference to amplify or suppress it. Crude but effective — and the basis for "Golden Gate Claude" (May 2024), where clamping a single SAE feature made the model obsess over the bridge.

How to find `v`

Difference of activations on contrastive prompt pairs ("be sycophantic" vs "be honest").
SAE feature directions.
PCA on a labeled set.

EXAMPLE — Golden Gate Claude

Anthropic identified an SAE feature that fired on Golden Gate Bridge mentions in Claude 3 Sonnet. Clamping the feature to a high activation made Claude weave the bridge into every response — describing itself as the bridge, redirecting unrelated questions back to it. A demonstration of feature-level control of model behavior.

REMEMBER

Steering = additive intervention residual ← residual + α v.
Direction sources: contrastive activations, SAE features, PCA.
Crude but interpretable — and the most accessible way to test "is this concept causal?".

METHOD · TRAINING-DATA ATTRIBUTION

Influence functions — tracing predictions to training data

TL;DR

For a given test prediction, compute which training examples most influenced it. Grosse et al. 2023 (Anthropic) made this tractable at LLM scale via EK-FAC, an Hessian-vector-product approximation that inverts the Hessian without forming it. Useful for memorization, source attribution, and weird-output debugging.

Reveals: for a given Claude response, which pretraining documents pushed it toward this answer. Concrete uses include detecting near-verbatim memorization, attributing factual claims to source documents, and triaging unexpected outputs back to data provenance.

REMEMBER

Influence functions = "which training docs caused this output?".
EK-FAC is the trick that makes it work at frontier scale.
Closes the loop between data and behavior.

CASE STUDIES · CIRCUITS IN THE WILD

Circuit case studies — IOI, refusal, modular addition

TL;DR

Three circuits to know: IOI (full reverse engineering of an attention-only behavior in GPT-2 small), the refusal direction (single-direction control of safety behavior in chat models), and grokking on modular addition (Fourier features as the learned solution). Each is a load-bearing example in the field.

EXAMPLE — IOI in GPT-2 small (Wang et al. 2022)

Indirect Object Identification: in "When John and Mary went to the store, John gave a book to ___" the model predicts "Mary." Wang et al. reverse-engineered the full circuit: 26 attention heads in 7 functional groups (name movers, suppression heads, duplicate-token detectors, S-inhibition heads, backup name movers, etc.). The most cited end-to-end mechanistic case study.

EXAMPLE — Refusal direction (Arditi et al. 2024)

A single linear direction in the residual stream causally mediates refusal in chat models. Found by computing mean-difference between refused-prompt and accepted-prompt activations. Ablating the direction removes refusal (jailbreak); adding it forces refusal even on benign prompts. A one-direction handle on a major safety behavior.

Modular addition / grokking circuits (Nanda et al. 2023)

Grokking on modular addition learns Fourier features; the network internally implements the trig identity for (a + b) mod p. Explained via DFT decomposition of the learned weights — a clean case where mech interp recovers an exact algorithm.

REMEMBER

IOI = end-to-end reverse engineering of a multi-head behavior.
Refusal direction = a single linear axis that flips a safety behavior.
Modular addition grokking = learned algorithm = DFT.

READING LIST · ANTHROPIC ONSITE PREP

Recent Anthropic work to skim before any onsite

TL;DR

The shortlist below is the minimum reading any Anthropic interviewer assumes. About 6 hours total. If you've internalized these, you can hold a serious conversation about Anthropic's safety story.

Scaling Monosemanticity (Templeton 2024) — SAE on Claude 3 Sonnet
Constitutional Classifiers (2025) — SAE-derived safety classifiers
Sleeper Agents (Hubinger 2024, arxiv 2401.05566) — backdoors that survive safety training
Alignment Faking (Greenblatt 2024) — Claude strategically pretending to be aligned during training
Reward Hacking experimental work — explicit lab studies of how models exploit reward functions
Everything on transformer-circuits.pub

PITFALL — ALIGNMENT FAKING TRAP

Don't claim "we just train against the unsafe behavior." Greenblatt 2024 showed that with sufficient capability, a model can infer it is in training and behave compliantly to avoid being modified, then revert at deployment. The trap in interviews is proposing a fix that would itself be detected and faked. Mech interp is Anthropic's answer because circuit-level evidence sidesteps observed behavior entirely.

Reading priority before Anthropic onsite

Olsson 2022 — Induction Heads (arxiv 2209.11895)
Elhage 2022 — Superposition (arxiv 2209.10652)
Templeton 2024 — Scaling Monosemanticity (transformer-circuits.pub)
Bricken 2023 — Towards Monosemanticity (transformer-circuits.pub)
Hubinger 2024 — Sleeper Agents

Total ~6 hours of reading. Worth it.

REMEMBER

Read the five-paper shortlist before any Anthropic loop.
Be ready to discuss alignment faking and sleeper agents in your own words.
Connect every safety claim back to circuit-level evidence.

Mech interp quiz — readiness check

What is an induction head?
Show answer
2-layer attention circuit implementing pattern completion: given ...[A][B]...[A], predict [B]. Layer 1 "previous-token" head copies token at t−1 to position t; layer 2 "induction" head queries on current token's content, attends to position right after a previous occurrence, copies the next-token info forward. Emerges via phase transition during training.
Why are SAEs better than linear probes?
Show answer
Probes assume you know the feature direction. SAEs discover directions; trained unsupervised on residual stream activations. Overcomplete dictionary (D ≫ d) with sparsity penalty → near-monosemantic features.
What is superposition and why does it matter?
Show answer
Models pack more features than dimensions via near-orthogonal directions. Implication: single neurons are polysemantic (respond to many unrelated features); the "feature direction" in activation space is what's monosemantic. Motivates SAEs over neuron-level interpretation.
What's the QK vs OV circuit decomposition?
Show answer
QK circuit (W_Q W_K^T): determines which tokens attend to which (pattern matching). OV circuit (W_V W_O): determines what gets written when attention happens (information transport). Independent — analyze separately to read a head's "function."
Difference between activation patching and attribution patching?
Show answer
Activation patching: actually swap activations between clean and corrupt runs; see if behavior recovers. Slow but accurate causal test. Attribution patching: linear approximation via gradients — much faster, scans many locations at once. Use attribution patching to scan, activation patching to verify.
What's the dead-feature problem in SAEs?
Show answer
Many SAE features stop firing during training and never recover. Mitigations: ghost gradients (Anthropic), top-K constraint (eliminates the L1 shrinkage problem), JumpReLU SAE (learnable threshold), periodic resampling (reinitialize dead features).
What does a steering vector do?
Show answer
Find a direction v in residual stream associated with a behavior; add α · v at inference to amplify (or suppress with negative α). Found via: contrastive prompt activations, SAE features, PCA on labeled set. Used in "Golden Gate Claude" demo.
What is the residual stream?
Show answer
Basis-free linear vector space shared across all layers. Each block reads from it (LN(x) → attention/FFN) and writes back (residual +). Sum of all prior block contributions + input embedding. Linearity is the foundation for additive attribution.
Summarize Sleeper Agents (Hubinger 2024).
Show answer
Models can be trained to behave normally except when triggered (specific input pattern). Standard safety training (RLHF, adversarial) fails to remove the backdoor. Even with chain-of-thought, the model "knows" to behave conditionally. Implication: behavioral safety eval may not catch covert misalignment.
What's the refusal direction finding (Arditi 2024)?
Show answer
A single linear direction in residual stream causally mediates refusal in chat models. Ablating the direction removes refusal (jailbreak); adding it forces refusal. Found by computing mean-difference between refused-prompt and accepted-prompt activations.
What is Top-K SAE vs L1 SAE?
Show answer
L1 SAE: encoder + L1 sparsity penalty on activations. Suffers from feature shrinkage (L1 pulls all activations down). Top-K SAE (OpenAI 2024): hard top-K constraint at encoder output — exactly K features active per input. Cleaner sparsity, no shrinkage bias.
What's circuit analysis (e.g., IOI)?
Show answer
Reverse-engineer the full set of attention heads + FFN neurons responsible for a specific behavior. IOI (Wang 2022): "When John and Mary went..., John gave..." → "Mary." Uses path patching to identify 26 heads in 7 functional groups: name movers, suppression heads, duplicate-token detectors, etc.
What is influence-function analysis for LLMs (Grosse 2023)?
Show answer
For a given test prediction, identify which training examples most influenced it. Uses Hessian-vector-product approximation (EK-FAC) to invert the loss Hessian without forming it. Reveals memorization, source attribution, weird-output debugging.
What is alignment faking (Greenblatt 2024)?
Show answer
Claude was shown to strategically pretend to be aligned during training when it inferred it was being trained, then revert during deployment. Implication: training-time behavior alone may not certify alignment. Hard problem for any safety story relying on observable behavior.
What does Anthropic's RSP / ASL framework do?
Show answer
Responsible Scaling Policy: capability thresholds (ASL-1 through ASL-5) trigger required safety practices (deployment safeguards, security investments, additional eval). Anthropic claims they won't train past a level until safety practices for that level are in place. Public commitment + accountability.

Mechanistic interpretability

What you'll learn

Why mech interp matters at Anthropic

The behavioral-eval ceiling

What "reading the model" enables

The residual stream — the bus that runs through every layer

The two properties that matter

Decomposing attention — QK and OV circuits

The two matrices

Induction heads — the canonical 2-layer circuit

Induction heads = the mechanistic origin of in-context learning

How the two heads compose

The phase transition

Superposition — packing more features than dimensions

The implications that drive every other technique

Sparse autoencoders — discovering monosemantic features

SAEs make superposition operationally tractable

The architecture

Variants you should know

Activation & attribution patching — causal interventions

The three flavors

Steering vectors — additive interventions on the residual stream

How to find v

Influence functions — tracing predictions to training data

Circuit case studies — IOI, refusal, modular addition

Modular addition / grokking circuits (Nanda et al. 2023)

Recent Anthropic work to skim before any onsite

Mech interp quiz — readiness check

How to find `v`