What is inner alignment?
Inner alignment is the problem of making sure that the goal an AI ends up pursuing is the same as the goal we optimized it for.
Machine learning uses an optimization algorithm
A general procedure for finding solutions that score highly according to some well-defined objective function.
In contrast to a mesa-optimizer, a base optimizer is the “outer” optimizer usually explicitly implemented by humans.
In contrast to a mesa-objective, the base objective is the “outer” objective usually explicitly implemented by humans.
An algorithm that is created by optimization and that is also itself an optimizer.
The algorithms that a base optimizer finds to solve the problem it has been given.
The objective pursued by a mesa-optimizer*.*
As an analogy: natural selection can be seen as an optimization algorithm that 'designed' humans to achieve the goal of high genetic fitness, or, roughly, "have lots of descendants". However, humans no longer primarily pursue reproductive success; they instead use birth control while still attaining the pleasure that natural selection ‘meant’ as a reward for attempts at reproduction. This is a failure of inner alignment.
The inner alignment problem can be split into sub-problems like deceptive alignment, distribution shifts, and gradient hacking.