What is mechanistic interpretability?
Mechanistic interpretability aims to “reverse engineer neural networks into human understandable algorithms”. It is a subfield of interpretability that emphasizes trying to understand the “actual mechanisms and algorithms that compose the network”, in contrast to other forms of interpretability research that try to explain how a network's outputs relate to human concepts without explaining the internal functioning of the network.
Three core hypotheses of the field are:
Claim 1: Features
Features are the fundamental unit of neural networks. They correspond to directions [in the vector space of possible neuron activations]. These features can be rigorously studied and understood.
Claim 2: Circuits
Features are connected by weights, forming circuits. These circuits can also be rigorously studied and understood.
Claim 3: Universality
Analogous features and circuits form across models and tasks.
Current mechanistic interpretability research focuses on identifying the circuits within a model that are responsible for particular behaviors, and on understanding phenomena such as grokking, superposition, and phase changes.