What is the difference between inner and outer alignment?

1 min read

Suggest changes in Google Docs

The paper Risks from Learned Optimization in Advanced Machine Learning

Systems makes the distinction between inner and outer alignment: Outer alignment means making the optimization target of the training process (“outer optimization target”, e.g., the loss in supervised learning) aligned with what we want. Inner alignment means making the optimization target of the trained system (“inner optimization target”) aligned with the outer optimization target. A challenge here is that the inner optimization target does not have an explicit representation in current systems, and can differ very much from the outer optimization target (see for example Goal Misgeneralization in Deep Reinforcement Learning).

See also this post for an intuitive explanation of inner and outer alignment.

Keep Reading

Continue with the next entry in "Basic concepts"

What are the differences between AI safety, AI alignment, AI control, Friendly AI, AI ethics, AI existential safety, and AGI safety?

Or jump to a related question

What is outer alignment?

What is inner alignment?