Take AISafety.info’s 3 minute survey to help inform our strategy and priorities

Take the survey

What is inner alignment?

Inner alignment is the problem of making sure that the goal an AI ends up pursuing is the same as the goal we optimized it for.

Machine learning uses an optimization algorithm called Stochastic Gradient Descent (SGD) to find algorithms which perform well according to some objective function. SGD is called the base optimizer and it finds learned algorithms that perform well according to the base objective. A mesa-optimizer is a learned algorithm that is itself an optimizer. A mesa-objective is the objective of a mesa-optimizer. So what we call the inner alignment problem is making sure that if the AI is a mesa-optimizer, its mesa-objective is equal to the base objective.

As an analogy: natural selection can be seen as an optimization algorithm that 'designed' humans to achieve the goal of high genetic fitness, or, roughly, "have lots of descendants". However, humans no longer primarily pursue reproductive success; they instead use birth control while still attaining the pleasure that natural selection ‘meant’ as a reward for attempts at reproduction. This is a failure of inner alignment.

The inner alignment problem can be split into sub-problems like deceptive alignment, distribution shifts, and gradient hacking.



AISafety.info

We’re a global team of specialists and volunteers from various backgrounds who want to ensure that the effects of future AI are beneficial rather than catastrophic.