What is mechanistic interpretability?

Mechanistic interpretability aims to “reverse engineer neural networks into human understandable algorithms”. It is a subfield of interpretability that emphasizes trying to understand the “actual mechanisms and algorithms that compose the network”, in contrast to other forms of interpretability research that try to explain how a network's outputs relate to human concepts without explaining the internal functioning of the network.

Three core hypotheses of the field are:1

> Claim 1: Features

Features are the fundamental unit of neural networks. They correspond to directions [in the vector space of possible neuron activations]. These features can be rigorously studied and understood.

Claim 2: Circuits

Features are connected by weights, forming circuits. These circuits can also be rigorously studied and understood.

Claim 3: Universality

Analogous features and circuits form across models and tasks.

Current mechanistic interpretability research focuses on identifying the circuits within a model that are responsible for particular behaviors, and on understanding phenomena such as grokking, superposition, and phase changes.

Further reading:


  1. These assumptions can be seen as analogous to Theodor Schwann’s Cell theory. The first two parts ofTheodor Schwann’s theory proposes that cells form the basic units of life, and make up all organisms. This is analogous to the features and circuits. The last part proposes cells arise from pre-existing cells, suggesting in “universality” that there are certain functional motifs that are preserved across species. ↩︎