What is inner alignment?

2 min read

Suggest changes in Google Docs

Inner alignment is the problem of making sure that the goal an AI ends up pursuing is the same as the goal we optimized it for.

Machine learning uses an optimization algorithm

called Stochastic Gradient Descent (SGD) to find algorithms which perform well according to some objective function. SGD is called the base optimizer and it finds learned algorithms that perform well according to the base objective. A mesa-optimizer is a learned algorithm that is itself an optimizer. A mesa-objective is the objective of a mesa-optimizer. So what we call the inner alignment problem is making sure that if the AI is a mesa-optimizer, its mesa-objective is equal to the base objective.

As an analogy: natural selection can be seen as an optimization algorithm that 'designed' humans to achieve the goal of high genetic fitness, or, roughly, "have lots of descendants". However, humans no longer primarily pursue reproductive success; they instead use birth control while still attaining the pleasure that natural selection ‘meant’ as a reward for attempts at reproduction. This is a failure of inner alignment.

The inner alignment problem can be split into sub-problems like deceptive alignment, distribution shifts, and gradient hacking.