What is AI alignment?

Intuitively, AI alignment means making an AI system’s goals line up with yours.

Toy model

It’s easiest to understand what alignment means if you imagine a hypothetical AI system that has two cleanly separated components: a “goal slot” that contains some “values”, and a “world model”. When making a decision, the system goes through each possible choice and uses its world model to predict what would happen if it made that choice, then uses its values to judge how good that outcome is. It then picks the choice expected to lead to the best outcome. Such a system is “aligned” with you to the extent that the values in its goal slot are the same as your values, and “misaligned” with you to the extent that they are different.

Discussions of the danger of AI misalignment have historically often used scenarios involving AI systems with this structure. For example, suppose you have a very powerful AI, and you want to use it to cure cancer. The most naive strategy might be to give it a value of “choose whatever possible future has the lowest number of cancer cases” — which the AI might conclude would be most effectively served by killing all humans. More sophisticated strategies (with their own problems) could involve coding in more complex specifications of human values, or having the AI learn values over time.

Current systems

Current frontier AI systems don’t have this structure, and the meaning of “aligning” such systems is less straightforward. For example, ChatGPT was created by first training a huge network to predict human text as accurately as possible, then adjusting it to favor text completions that were rated highly by human evaluators. We don’t know how the resulting system works, but we have no reason to think that it’s trying to predict the consequences of its decisions and check them against some set of values, like the systems we described above — it’s more like it’s learned a huge number of individual behaviors.

Still, the concept of “alignment” isn’t totally meaningless here: we can say that ChatGPT “is capable of” spewing abuse, because it has learned how to predict abusive internet text, but we can say that it “chooses” not to do so as a result of reinforcement learning, and is (usually) “aligned” with OpenAI’s intended values in that sense.

Future systems

It’s not clear how similar future AI systems will be to current ones, or to the toy model above — some argue that future systems will face pressures to more systematically optimize the world, but there’s a lot of disagreement and conceptual subtlety around questions like this.