Mechanistic Interpretability

Mechanistic Interpretability

mechanistic interpretability is mainly reverse engineering a trained DL model to peek into what it’s looking at in order to perform the task in hand
Github Repo: Tickling-Vision-Models
core concepts:
feature visualization (activation maximization)
using Gradient Ascent to identify images / samples that maximize a selected circuit (: neuron/ conv channel/ layer/ …)
x∗=arg⁡max⁡xai(x)−λR(x)x^* = \arg\max_{x} a_i(x) - \lambda \mathrm{R}(x)x∗=argxmax​ai​(x)−λR(x)
where R(x) is a regularization term could be:
L2normL_2 normL2​norm﻿​
TV\mathrm{TV}TV﻿​
Jitter\mathrm{Jitter}Jitter﻿​
…
Linear Probing 
train a linear classifier on activations to get an intuition of manifold shape: learned representation
SAE decomposition
z=E(h),h^=D(z)L=1N∑∥hn−D(E(hn))∥22+λ∥E(hn)∥1z = E(h), \quad \hat{h} = D(z) \\
\mathcal{L} = \frac{1}{N} \sum\|h_n - D(E(h_n))\|_2^2 + \lambda \|E(h_n)\|_1z=E(h),h^=D(z)L=N1​∑∥hn​−D(E(hn​))∥22​+λ∥E(hn​)∥1​
learning a disentangle latent representation by encouraging sparsity in a lower manifold
Feature Manifold & Geometry
manifold cluster based on feature F\mathrm{F}F﻿: subset of points sharing F\mathrm{F}F﻿​
MF=aL(x)∣x has feature F,aL is the activation of layer LM_F={a_L(x)∣x \text{ has feature } F}, \quad a_L\text{ is the activation of layer }LMF​=aL​(x)∣x has feature F,aL​ is the activation of layer L
dim(MF)≤ddim(MF)≈PCAk,top k eigen vectors preserving 95% of variance\text{dim}(M_F) \le d \\
\text{dim}(M_F) \approx \mathrm{PCA}_k, \quad \text{top k eigen vectors preserving 95\% of variance}dim(MF​)≤ddim(MF​)≈PCAk​,top k eigen vectors preserving 95% of variance
local dimension via PCA: 
A~=A−1μT,μ=1n∑aiσ=1n−1A~TA~\tilde{A} = A - \textbf{1} \mu^T, \quad \mu = \frac{1}{n} \sum a_i \\
\sigma = \frac{1}{n-1} \tilde{A}^T\tilde{A}A~=A−1μT,μ=n1​∑ai​σ=n−11​A~TA~
Given Eigen values: λ1,λ2,…,λdintristic dimension: k=min{m|∑i=1mλi∑i=1dλi≥0.95}\text{Given Eigen values: }\lambda_1, \lambda_2, \dots, \lambda_d \\
\text{intristic dimension: }k = min\left\{m \middle| \frac{\sum_{i=1}^m \lambda_i}{\sum_{i=1}^d \lambda_i} \ge 0.95      \right\}Given Eigen values: λ1​,λ2​,…,λd​intristic dimension: k=min{m​∑i=1d​λi​∑i=1m​λi​​≥0.95}
curvature: flat manifold → smooth travel on surface
       curved manifold → bumpy ride 
(a short step might jump to a semantically different region)
dG(a,b)=∫abgγ(t)(γ˙(t),γ˙(t))dtd_G(a, b) = \int_a^b \sqrt{g_{\gamma(t)} \big(\dot{\gamma}(t), \dot{\gamma}(t)\big)}dtdG​(a,b)=∫ab​gγ(t)​(γ˙​(t),γ˙​(t))​dt
dGd_GdG​﻿ components
gγ(t)g_{\gamma(t)}gγ(t)​﻿: how to measure lengths & angles in a curved manifold
velocity vector at t: γ˙(t)=dγ(t)dt=(dx1dt,dx2dt,...,dxndt)\text{velocity vector at t: }\dot{\gamma}(t) = \frac{d\gamma(t)}{d t} = \big(\frac{d x_1}{dt}, \frac{d x_2}{dt}, ..., \frac{d x_n}{dt}  \big)velocity vector at t: γ˙​(t)=dtdγ(t)​=(dtdx1​​,dtdx2​​,...,dtdxn​​)
dG: Geodesic distance, dE:euclidian distancecurvature index: k(p,q)=dG(p,q)−dE(p,q)dE(p,q)d_G: \text{ Geodesic distance, } d_E: \text{euclidian distance}
\\\text{curvature index: } \quad \mathcal{k}(p, q) = \frac{d_G(p, q) - d_E(p, q)}{d_E(p, q)}dG​: Geodesic distance, dE​:euclidian distancecurvature index: k(p,q)=dE​(p,q)dG​(p,q)−dE​(p,q)​
Geodesic distance: walking on the manifold
Euclidian distance: walking in a straight path (through the manifold)
connectedness: all x containing F reachable through a smooth shift / interpolation 
test: interpolation effect on classification / feature identification & activation
Adversarial Examples (FGSM & PGD) & mechanistic view
FSGM: x′=x+ϵ⋅sign(∇xL(f(x),y)PGD:xt+1=xt+λ⋅sign(∇xL(f(xt),y)\text{FSGM: } \quad x^\prime = x + \epsilon \cdot \mathcal{sign}(\nabla_x \mathcal{L}\big(f(x), y\big) \\

\text{PGD:} \quad x_{t+1} = x_t + \lambda \cdot \mathcal{sign}(\nabla_x \mathcal{L}\big(f(x_t), y\big)FSGM: x′=x+ϵ⋅sign(∇x​L(f(x),y)PGD:xt+1​=xt​+λ⋅sign(∇x​L(f(xt​),y)
adversarial perturbation introduces a change in activation space: Δa=a(x′)−a(x)\Delta a = a(x^\prime) - a(x)Δa=a(x′)−a(x)﻿ that can be decomposed using an SAE ⇒ determine hijacked circuits by learning the sparse representation
a≈Wdecode⋅h,h≈ReLU(Wencode⋅a)Δh=h(x′)−h(x)a \approx W_{decode} \cdot h, \quad h \approx \mathrm{ReLU}(W_{encode} \cdot a) \\
\Delta h = h(x^\prime) - h(x)a≈Wdecode​⋅h,h≈ReLU(Wencode​⋅a)Δh=h(x′)−h(x)
Adversarial Path:
An adversarial path is a continuous trajectory through input space (or representation space) that starts at one label and ends at another, while staying imperceptible or minimally different to a human observer.
formal definition: 
min⁡x(t)∫01∥∇xL(h(x(t)),y(t))∥dt\min_{x(t)} \int_0^1 \| \nabla_x\mathrm{L}\big(h\small(x(t)\small), y\small(t\small)\big) \|dtx(t)min​∫01​∥∇x​L(h(x(t)),y(t))∥dt
clustering Activation atlases:
looking for fractures as multi-cluster concepts
Circuit analysis techniques:
ablation:
circuit silence downstream effect
Δlogit=logitorigin−logitablated\Delta logit = logit_{origin} - logit_{ablated}Δlogit=logitorigin​−logitablated​
patching:
replacing a subset of activations for target images from a donor image
feature visualization in learned sparse representation
ALT
Metrics:
probe accuracy vs chance baseline
reconstruction loss: MSE on SAE 
adversarial success rate
effect size in ablation/patching
clustering purity
purity=1N∑Cmax⁡∣{x∈c:label(x)=ℓ}∣\mathrm{purity} = \frac{1}{N} \sum_{C} \max \big|\{x \in c: \mathrm{label(x)=\ell}\}\big|purity=N1​C∑​max​{x∈c:label(x)=ℓ}​
to check
GIG (PFVs, ERFs)
causal tracing (BLIP)
TCAV: Testing with Concept Activation Vectors
drawing a linear decision boundary in latent space of a selected layer of the model / network separating samples with feature F from samples without it
CAV formally: 
h(x)=σ(X⋅W+b)Concept Vector (boundary): V=W∥W∥h(x) = \sigma(X \cdot W + b)
\\ \text{Concept Vector (boundary): } \quad V = \frac{W}{\|W\|}h(x)=σ(X⋅W+b)Concept Vector (boundary): V=∥W∥W​
given fi(x)f_i(x)fi​(x)﻿: logits for class i, c: concept, l: layer
TCAVc,i=1∣X∣∑x∈X1 ⁣[∂fi(x)∂aℓ(x)⋅Vc>0]\mathrm{TCAV}_{c, i} = \frac{1}{\mid X \mid} \sum_{x \in X} \mathbf{1} \!\left[\frac{\partial f_i(x)}{\partial \mathbf{a}_\ell(x)} \cdot \mathbf{V}_c > 0\right]TCAVc,i​=∣X∣1​x∈X∑​1[∂aℓ​(x)∂fi​(x)​⋅Vc​>0]
how V is aligned with the mentioned gradient reflect the concept contribution / correspondence to that class
Integrated Gradient
let x′x^\primex′﻿ be a baseline input (mean / zero image)
xα=x+α(x−x′),IGi(x)=(xi−xi′)∫01∂f(xα)∂xidαx_\alpha = x + \alpha (x - x^\prime), \quad \mathrm{IG}_i(x) = (x_i - x_i^\prime ) \int_0^1 \frac{\partial f(x_\alpha)}{\partial x_i} d\alphaxα​=x+α(x−x′),IGi​(x)=(xi​−xi′​)∫01​∂xi​∂f(xα​)​dα
Layer Integrated Gradient 
let z=fℓ(x),z′=fℓ(x′)z = f_\ell(x), \quad z^\prime = f_\ell(x^\prime)z=fℓ​(x),z′=fℓ​(x′)﻿​
x+α(x−x′),LIGi(x)=(zi−zi′)∫01∂f(xα)∂zidαx + \alpha (x - x^\prime), \quad \mathrm{LIG}_i(x) = (z_i - z_i^\prime ) \int_0^1 \frac{\partial f(x_\alpha)}{\partial z_i} d\alphax+α(x−x′),LIGi​(x)=(zi​−zi′​)∫01​∂zi​∂f(xα​)​dα
Generalized Integrated Gradient
let γ(α)\gamma(\alpha)γ(α)﻿ be a smooth path
GIGi(x)=∫01∂f(γ(α))∂γi(α)∂γi(α)∂αdα\mathrm{GIG}_i(x) = \int_0^1 \frac{\partial f\big(\gamma(\alpha)\big)}{\partial \gamma_i(\alpha)} \frac{\partial \gamma_i(\alpha)}{\partial \alpha} d\alphaGIGi​(x)=∫01​∂γi​(α)∂f(γ(α))​∂α∂γi​(α)​dα
Discretized Integrated Gradient:
GIG using a discrete path (?): common in NLP
Concept localization in hidden layers
CAV projected IG score per layer
TCAV: sign consistency of batch of samples
concepts vs layers heatmap / matrix
Concept Flow throughout the network
concept score through layers:
sℓ(x)=aℓ(x)⋅VcS(x)=[s1(x),s2(x),...,sL(x)]s_\ell(x) = a_\ell(x) \cdot \mathbf{V_c} \\
S(x) = [s_1(x), s_2(x), ..., s_L(x)]sℓ​(x)=aℓ​(x)⋅Vc​S(x)=[s1​(x),s2​(x),...,sL​(x)]
PFV: Pointwise Feature Vector
CAV per spatial point, no GAP
Dimensionality Reduction x High Dimensions projection (intrinsic dims extraction):
(to elaborate on/ note)
PCA
ICA
KPCA
t-SNE
UMAP