What is the "sharp left turn"?

The sharp left turn (SLT) threat model is based on the argument that an AI’s capabilities generalize further than its alignment. In other words, if an AI went through a SLT, its capabilities would quickly generalize outside the training distribution, but its alignment won’t be able to keep up, resulting in a powerful misaligned model. This model consists of three main claims:

  1. Capabilities will generalize far (i.e., to many domains)

An advance in capabilities could generalize in multiple domains, possibly at the same time or during a discrete phase transition.

  1. Alignment techniques that worked previously will fail during this transition

The increase in capabilities and generalization would arise from emergent properties which would be qualitatively different from what the model used previously. As a result, alignment techniques that applied to old versions of the AI would not work on the new version.

  1. Humans won’t be able to intervene to prevent or align this transition

The transition would happen too quickly for humans to notice or develop new alignment techniques in time.

If these claims are correct, it is likely that we would end up with a misaligned AI after the SLT if we rely on alignment techniques that do not consider the possibility of a SLT. This could be avoided if we manage to build a goal-aligned AI before the SLT occurs; i.e. the AI should have beneficial goals and situational awareness. If this is the case, the AI will try to preserve its beneficial goals by developing new techniques to align itself as it goes through the SLT, and give rise to an aligned model post-SLT. To read more about plans and caveats regarding the SLT, click here.