What is the difference between goal misgeneralization and capabilities misgeneralization?

Misgeneralization refers to the phenomenon where a model performs well during training but poorly during testing. At a high level, this could be because the model’s internal representations are too specific to the training dataset or environment, and do not generalize well under the distribution shift to testing.

Misgeneralization failures can be categorized into two distinct categories. Capabilities misgeneralization is when the model is not capable of pursuing the intended goal. This would include scenarios where the model appears to “break” or behave randomly in the test environment. Goal misgeneralization is when the model is capable of pursuing the intended goal, but instead pursues a different goal.

In particular, this distinction is made by evaluating the behavior of the model. In order for a failure to be considered goal misgeneralization, the behavior of the model must exhibit capabilities which could be used to pursue the intended goal, but there is an alternative goal which can better explain the exhibited behavior.

(explain one of the examples below more thoroughly)

The following table contains more examples goal misgeneralization:

Source - Shah et al., Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals (2022)

Goal misgeneralization can occur when there are multiple functions that the model can learn that perform comparably well in training, but perform differently in testing. When this occurs, the learned function will depend on the inductive biases of the training algorithm and on random effects, such as how the parameters of the model were randomly initialized.