What are polysemantic neurons?

3 min read

Neural networks often contain “polysemantic neurons” that respond to multiple unrelated inputs, unlike “monosemantic” neurons which only fire in response to a single human-understandable feature or concept.

As an example, this image shows feature visualizations of one particular neuron. This neuron activates when it sees either a cat face, cat legs, or the front of a car. Note that the reason why this neuron is responding to both cats and cars is not that they share some subtle feature.

How do we know that the neurons are not encoding some sort of shared similarity?

The reason we know that the network is not looking at some underlying similarity is because of work that researchers have done in creating feature visualizations with diversity.^[1] The authors found high activations with maximally "cat-face-ish", "car-ish", and "furry leg-ish" inputs individually, rather than some weird visualization of "whatever it is that cats and cars share".

Why do polysemantic neurons form?

They seem to result from a phenomenon known as “superposition”. This is basically where a circuit combines multiple features in neurons, in such a way as to pack more features into the limited number of neurons it has available. This allows the model to use fewer neurons, conserving them for more important tasks.

What are the consequences of polysemantic neurons arising in networks?

Polysemantic neurons are a major challenge for the “circuits” agenda because they significantly limit our ability to reason about neural networks. Because they encode multiple features, it’s harder to consider these neurons individually when thinking about circuits. As an example: if we only have two polysemantic neurons, which encode five different features each, then we have effectively 25 different connections between features that are all governed by a single weight.

Polysemanticity and superposition are both active fields of research. Even though they pose a challenge to interpretability, there is hope that further research into disentangling representations will help us overcome these challenges.

Diversity here refers to variety in the types of feature representations generated. Usually, similar looking images will cause a neuron to get similarly activated. This results in very similar looking examples of what a given feature encodes. This problem is addressed by introducing a diversity term in the loss function. This means that we generate a batch of visualizations and penalize the network if they are too similar to each other. This results in a diverse set of images representing the same feature. ↩︎

How much can we learn about AI with interpretability tools?

What is interpretability and what approaches are there?

What are circuits in interpretability?

Interpretability Definitions