What are polysemantic neurons?
For a “monosemantic” neuron, there’s a single feature that determines whether or not it activates strongly. If a neuron responds only to grandmothers, we might call it a grandmother neuron. For a “polysemantic” neuron, in contrast, there are multiple features that can cause it to activate strongly.
As an example, this image shows feature visualizations of a neuron that activates when it sees any of a cat face, cat legs, or the front of a car. As far as anyone can tell, this neuron is not responding to both cats and cars because cars and cats share some underlying feature. Rather, the neuron just happened to get two unrelated features attached to it.
Why do we think that the neurons are not encoding a shared similarity?
Suppose a neuron is picking out some feature shared by cars and cats. Say, the neuron is representing “sleekness”. Then we’d expect images of other “sleek” things, like a snake or a ferret, to activate the neuron. So if we generate lots of different images which highly activate our neuron, and find that they do contain snakes and ferrets, that’s evidence for the neuron picking up on a unified concept of sleekness. Researchers have run experiments like this on neurons like this one and found that, no, they just activate on cats and cars — just as the “polysemantic” hypothesis would lead us to expect.
Why do polysemantic neurons form?
Polysemantic neurons seem to result from a phenomenon known as “superposition”.
This would be possible most of the time because most sets of features would have a property called "sparsity". "Sparsity" in this case means that, even if each feature in the set has some plausibly occurring network input for which the feature would be a number far from zero, still for most inputs most features are very small or zero.
In fact, if we only care about packing as many features into n neurons as we can, then using polysemantic neurons lets us pack roughly as many as exp(C * n) features, where C is a constant depending on how much overlap between concepts we allow.1
What are the consequences of polysemantic neurons arising in networks?
Polysemantic neurons are a major challenge for the “circuits” research agenda for neural network interpretability, because they limit our ability to reason about neural networks. It’s harder to interpret what computation is being done by a circuit made out of neurons if those neurons' activations have multiple meanings. As an example: in a circuit where we only have two polysemantic neurons, which encode five different features each, and one weight governing a connection between them, then we have effectively 25 different connections between features that are all governed by that single weight.
However, there has been some recent progress. In 2023, Anthropic claimed to achieve a breakthrough on this problem in their paper “Towards Monosemanticity”. Anthropic trained large "sparse autoencoder" networks on the non-sparse ("dense") activations in other, more polysemantic neural networks, to decompose those activations in the form of sparse activations from among a larger number of neurons. These sparse activations were (reported to be) more monosemantic, corresponding to more interpretable features.
Christopher Olah of Anthropic stated he is “now very optimistic [about superposition]”, and would “go as far as saying it’s now primarily an engineering problem — hard, but less fundamental risk.” Why did we caveat Anthropic’s claims? Because some researchers, like Ryan Greenblatt, are more skeptical about the utility of sparse autoencoders as a solution to polysemanticity.
This is a consequence of the Johnson-Lindenstrauss lemma. As this estimate doesn’t account for using the exponential number of features for useful computations, it is unclear if neural networks actually achieve this bound in practice. (The use of polysemanticity in computations is an active research area. For a model of how polysemanticity aids computations, see “Towards a Mathematical Framework for Computation in Superposition”.) What about lower bounds on how much computation can be done with polysemantic neurons? Well, these estimates depend on assumptions about the training data, number of concepts, initialization of weights etc. So it is hard to give a good lower bound in general. But for some cases, we do have estimates: e.g., “Incidental polysemanticity” notes that, depending on the ratio of concepts to neurons, the initialization process can lead to a constant fraction of polysemantic neurons. ↩︎