What are circuits in interpretability?

Circuits are a concept in mechanistic interpretability that examines individual neurons or even the specific numbers that make up individual weights in a neural network. The objective is to learn which neurons fire together and thereby elicit the underlying learned algorithms by discovering potential patterns within the weights/parameters of the network as opposed to looking at the outputs of the entire network as a whole.

An intuitive example of circuits might be a ‘dog circuit’ that is composed of individual features[1] of eyes, snout, fur, and tongue. The image belowis image shows an example of a circuit that can identify complex curves in an image by breaking down the identification into smaller identifications of basic curve features.


The following is a little bit more of a technical explanation the builds up the concept of circuits piece by piece:

  • Features: Neural networks are essentially just made up of a lot of numbers. Sets of these numbers are somehow representing different aspects of the world. We call these represented aspects of the world features. Some researchers claim that these features are the fundamental units of neural networks. We can put study into which sets of numbers/parameters in the network represent which aspects of the world. Overall features can (and should) be rigorously studied and understood.

Examples of what the word feature means in this context can be things like curve detectors as well as high-low frequency edge detectors. These would correspond to low-level features in the image below and are found in almost every non-trivial vision model. There are claims that neural networks encode higher-level human understandable features (e.g. such as ears and faces) in the later layers, however, there’s significant skepticism in the community for the existence of such higher-level features.


  • Circuits: Once we accept the claim that a certain set of numbers/parameters in neural networks encode some features, we can begin to look at sets of sets of parameters. In other words, we can start looking at sets of features. These features are connected by weights, forming circuits. Similar to how we can understand individual features, researchers claim that these circuits can also be rigorously studied and understood. During their study, researchers have already found beautiful rich structures in circuits. This means that we can literally read meaningful algorithms off of the weights of a network! This is something that is very promising for alignment research as it helps us understand the underlying learned algorithm that the neural network is encoding which was previously a complete black box algorithm.

Applicability to alignment

The final claim within the circuits agenda is that analogous features and circuits form across models and tasks and that circuits might display a kind of universality or “convergent learning”. This claims that there might be similar patterns in the weights of the networks that learn similar features. In other words, they claim that neural networks might be trained on different data, but if they learn to recognize similar concepts then the encoding of these concepts through the underlying circuits will be similar through multiple networks. If this were true then the authors envision a kind of “periodic table of visual features” which will be applicable across models.


While the field looks promising, several limitations and roadblocks need to be overcome. One major challenge to circuits are polysemantic neurons that arise due to superposition. Since the network has to recognize more features than it has neurons, in order to pack more features into the existing architecture it will use one neuron to encode multiple features. Which means that we cannot study them in isolation. Another roadblock would be if the universality claim were mostly false since this would mean interpretability would have to focus on a handful of models of particular societal importance.

Overall the circuit's agenda seems very promising in helping us open up the black box of neural networks. They are well posed to be a tool in the interpretability toolkit in helping mitigate the negative effects of inner and deceptive alignment since we might now have the ability to study the underlying learned algorithm from just observations of the network parameters.

  1. The word feature is used in at least four different ways in ML: 1) properties of input data, 2) whatever the network dedicates one entire neuron towards understanding, 3) an arbitrary function composed of neuron activations, 4) an arbitrary function composed of neuron activations that leads to human understandable representations. This article will be using the word in the 4th context. Papers in interpretability research often use 2) , 3) and 4) as a combined working definition. ↩︎

  2. image source ↩︎