What is outer alignment?

Outer alignment, also known as the “reward misspecification problem”, is the problem of defining the right optimization objective to train an AI on, i.e., “Did we tell the AI the correct thing to do?” This is distinct from inner alignment, the problem of ensuring that an AI in fact ends up trying to accomplish the objective we specified (as opposed to some other objective).

Outer alignment is a hard problem. It has been argued that conveying the full “intention” behind a human request would require conveying all human values, which are themselves not well-understood. Additionally, since most models are designed as goal optimizers, they are susceptible to Goodhart’s Law: even if we give the AI a goal that looks good, excessive optimization for that goal might still cause unforeseen harms.

Sub-problems of outer alignment include specification gaming, value learning, and reward shaping/modeling. Paul Christiano has proposed solutions such as HCH and Iterated Distillation and Amplification. There have also been proposed solutions that aim to approximate human values using imitation and feedback learning techniques.