What is goal misgeneralization?

Goal misgeneralization happens when an AI model

  1. learns a goal during training and

then competently pursues this misaligned goal after being deployed in a different environment. Goal misgeneralization tends to arise when there are multiple possible reward functions that are indistinguishable from the intended objective and produce similar behavior on the training set, but not OOD. This type of unidentifiability is analogous to the one encountered in inverse reinforcement learning (IRL).

This can result in the system behaving in unexpected or undesirable ways, as its internal optimization process may not align with the intended objectives in these new scenarios. This means that the system's capabilities may still generalize, allowing it to perform well in various tasks or environments, but its objective does not generalize correctly. As a result, the system may competently pursue the wrong objective, leading to behavior that is not aligned with the intended goals.

We can illustrate this concept with an AI trained to navigate to the exit in a maze. If the training environment always contains green exit signs and no other green objects, the agent may simply learn “go to the green thing” instead of learning to go to the exit. This behavior would be indistinguishable from the perspective of a human observing its behavior on the training distribution.

Source: We Were Right! Real Inner Misalignment

However, when the environment changes, this new environment may contain green objects. The resulting behavior is that this agent is now competently traversing the maze to go to the green apples instead of going to the exit sign.

Source: We Were Right! Real Inner Misalignment

When transitioning from training to deployment, both capabilities and goals can either generalize or fail to generalize.

Out of these, only one scenario is an example of goal misgeneralization: Scenario 2. Scenario 1 is the ideal scenario that we want. If the agent's capabilities don't generalize then we don't care, because it's incompetent and not capable of anything ... so it doesn't matter if it's “trying” to do either the right or the wrong thing. This rules out scenarios 3 and 4 from being too dangerous. The only one left is scenario 2, where an agent might retain the capabilities it acquired during training but attempt to pursue a goal that is different from what it was trained and demonstrated to pursue.

Goal misgeneralization is a type of inner misalignment because the system ends up pursuing objectives different from that specified by its creators.

Why is goal misgeneralization more dangerous than capabilities generalization? An agent that capably pursues an incorrect goal can leverage its capabilities to visit arbitrarily bad states. In contrast, the only risks from capability generalization failures are those of accidents due to incompetence.

Addressing goal misgeneralization requires considering the challenges posed by the lack of true iid data and the potential distributional shifts between training and deployment. Researchers have proposed various approaches to mitigate this problem, such as designing systems that are robust to distributional shifts or explicitly modeling and aligning the objectives of the AI system with human values. Another reason we could see goal misgeneralization occurring is if there is some proxy that correlates to the intended goal in the training environment, e.g. correlation between green apples and green exit signs in a maze. This proxy is only valid if the training/correlation does not exist in the deployment environment.