What is feature visualization?
Feature
A feature of a region of input space that corresponds to a useful pattern. For example, in an image detector, a set of neurons that detects cars might be a feature.
For example, after training an image classifier to classify animals, we can ask the network to output a picture of what it (with highest possible probability) would consider to be "dog-ish". This picture can then be considered a visualization of the network's dog-detecting "feature"1
Source: distill.pub.
Feature visualization attempts to create images that represent what any given part of a neural network
A simulated network of nodes (‘neurons’) and their connections (weights). Neural networks are the core component of deep learning, the leading AI paradigm.
Interpretability tools like feature visualization are useful for alignment research because they allow us to in some sense see the "internal thoughts" of a neural network. The hope is that many interpretability tools combined will allow us to shed light on the internal algorithm that a neural network has implemented.
Now that we have an understanding of the intuitions behind the concept we can dive a little deeper into the technical details. Feature visualization is mainly implemented using a technique called visualization by optimization which allows us to create an image of what the network "imagines" particular features to look like. It works as follows:
Visualization by optimization: Regular optimization, as used in the training of the image classifier, involves updating the network's weights
The parameters of a neural network. They are tuned during training and are mostly sufficient to implement the AI model.
Let's walk through an example of visualization by optimization. We begin with a trained image classifier network. We input an image made from random pixels. We then progressively alter the pixel values (using gradient descent) to increase the network's prediction of the class "dog", until the network is maximally sure that the set of pixels it sees depicts a dog. The resulting image is, in a sense, the "doggiest" picture that the network can conceive of, and therefore gives us a good understanding of what the network is "looking for" when determining to what extent an image depicts a dog.
So far in our examples we have only optimized in relation to a single label. However, instead of optimizing for the activation of a class label, we can also optimize for the activation of individual neurons, layers, convolutional channels, etc. All of this combined helps to isolate the causes of a model's classifications from mere correlations in the input data. It also helps us visualize how a network's understanding of a feature evolves through the course of the training process.
This also creates some very exciting opportunities in feature generation. Some researchers have experimented with this using interpolation between neurons. For example, if we add a “black and white” neuron to a “mosaic” neuron, we obtain a black and white version of the mosaic. This is reminiscent of the semantic arithmetic of word embeddings as seen in Word2Vec.
Feature visualizations are yet another tool in the toolkit of interpretability researchers who are seeking to move our understanding of neural networks beyond black boxes.
The word feature is used in at least four different ways in ML: 1) properties of input data, 2) whatever the network dedicates one entire neuron towards understanding, 3) an arbitrary function composed of neuron activations, 4) an arbitrary function composed of neuron activations that leads to human understandable representations. This article will be using the word in the 4th context. Papers in interpretability research often use 2) , 3) and 4) as a combined working definition. ↩︎