What is outer alignment?
Outer alignment, also known as the “reward misspecification problem”, is the problem of training an AI with the right optimization objective, i.e., “Did we tell the AI the correct thing to do?”. This is distinct from the inner alignment problem, the problem of whether an AI in fact ends up trying to accomplish the objective we specified (as opposed to some other objective).
Outer alignment is a hard problem. It has even been argued that to convey the full “intention” behind a human request would require conveying all human values, which are themselves not well-understood. Additionally, since most models are designed as goal optimizers, they are susceptible to Goodhart’s Law. This indicates that even if we give the AI a goal that looks good, excessive optimization for that goal might still cause unforeseen harms.
Some sub-problems of outer alignment include specification gaming, value learning, and reward shaping/modeling. Paul Christiano, a researcher who focuses on outer alignment, has proposed solutions such as HCH and iterated distillation and amplification. There have also been proposed solutions to approximate human values using imitation and feedback learning techniques.