Mechanistic Interpretability
Get Notion free

Mechanistic Interpretability

mechanistic interpretability is mainly reverse engineering a trained DL model to peek into what it’s looking at in order to perform the task in hand

core concepts:

feature visualization (activation maximization)

using Gradient Ascent to identify images / samples that maximize a selected circuit (: neuron/ conv channel/ layer/ …)
where R(x) is a regularization term could be:
L2normL_2 norm
TV\mathrm{TV}
Jitter\mathrm{Jitter}

Linear Probing

train a linear classifier on activations to get an intuition of manifold shape: learned representation

SAE decomposition

learning a disentangle latent representation by encouraging sparsity in a lower manifold

Feature Manifold & Geometry

manifold cluster based on feature F\mathrm{F}: subset of points sharing F\mathrm{F}
local dimension via PCA:
curvature: flat manifold → smooth travel on surface
curved manifold → bumpy ride (a short step might jump to a semantically different region)
dGd_G components
gγ(t)g_{\gamma(t)}: how to measure lengths & angles in a curved manifold
Geodesic distance: walking on the manifold
Euclidian distance: walking in a straight path (through the manifold)
connectedness: all x containing F reachable through a smooth shift / interpolation
test: interpolation effect on classification / feature identification & activation

Adversarial Examples (FGSM & PGD) & mechanistic view

adversarial perturbation introduces a change in activation space: Δa=a(x)a(x)\Delta a = a(x^\prime) - a(x) that can be decomposed using an SAE ⇒ determine hijacked circuits by learning the sparse representation

Adversarial Path:

An adversarial path is a continuous trajectory through input space (or representation space) that starts at one label and ends at another, while staying imperceptible or minimally different to a human observer.
formal definition:

clustering Activation atlases:

looking for fractures as multi-cluster concepts

Circuit analysis techniques:

ablation:

circuit silence downstream effect

patching:

replacing a subset of activations for target images from a donor image
feature visualization in learned sparse representation
ALT

Metrics:

probe accuracy vs chance baseline
reconstruction loss: MSE on SAE
adversarial success rate
effect size in ablation/patching
clustering purity

to check

GIG (PFVs, ERFs)
causal tracing (BLIP)

TCAV: Testing with Concept Activation Vectors

drawing a linear decision boundary in latent space of a selected layer of the model / network separating samples with feature F from samples without it
CAV formally:
given fi(x)f_i(x): logits for class i, c: concept, l: layer
how V is aligned with the mentioned gradient reflect the concept contribution / correspondence to that class

Integrated Gradient

let xx^\prime be a baseline input (mean / zero image)

Layer Integrated Gradient

let z=f(x),z=f(x)z = f_\ell(x), \quad z^\prime = f_\ell(x^\prime)

Generalized Integrated Gradient

let γ(α)\gamma(\alpha) be a smooth path

Discretized Integrated Gradient:

GIG using a discrete path (?): common in NLP

Concept localization in hidden layers

CAV projected IG score per layer
TCAV: sign consistency of batch of samples
concepts vs layers heatmap / matrix

Concept Flow throughout the network

concept score through layers:

PFV: Pointwise Feature Vector

CAV per spatial point, no GAP

Dimensionality Reduction x High Dimensions projection (intrinsic dims extraction):

(to elaborate on/ note)
PCA
ICA
KPCA
t-SNE
UMAP