What is AI alignment?
Informally, AI alignment means making an AI system’s goals line up with yours.
A simple model
Imagine a hypothetical AI system with two separate parts:
-
A set of “goals”, “preferences”, or “values”1
, meaning which outcomes it will act to achieve over other outcomes.When we're talking about humans, words like “preferences” and “values” sometimes have connotations that we don’t mean to invoke here. For example, we’re not saying an AI's preferences are a kind of emotional state, or that it has "values" in an ethical or moral sense. “Preferences” and “values” here are defined purely in terms of which outcomes the system will tend to choose over others. -
A set of “beliefs” or a “world model
”, meaning what it considers to be true about the world and what it predicts will happen.World modelA system’s internal representation of its environment, which it uses to predict what will happen, including as a result of its own possible actions.
When this AI makes a decision, it considers each possible action it could take, uses its beliefs about the world to predict the result of that choice, and then uses its preferences to judge how good that result is. It then picks the choice it expects to lead to the best result.
The concept of “alignment” is relatively straightforward here: the system is “aligned” with you to the extent that its values are the same as yours, and “misaligned” with you to the extent that they are different.
Historically, discussions of the danger of AI misalignment have often used scenarios involving AI systems with this structure. In one such scenario, you have a very powerful AI, and you want to use it to cure cancer. The most naive strategy for achieving this might be to give it the goal of “minimize the number of cancer cases” — which the AI might conclude would be most effectively achieved by killing all humans. More sophisticated alignment strategies2
This simple model is a type of goal-directed AI. If powerful AIs necessarily behave in goal-directed ways, then it becomes easy to see how they would be very dangerous. The powerful AI learns the wrong goal, and for most goals, human flourishing isn’t how you maximize them. But will powerful AI be goal-directed in this way?
Current systems
Current frontier AI systems don’t seem to have the "values" and "world model" structure described above. So it’s unclear whether the idea of “aligning” such systems is meaningful.
For example, an LLM like ChatGPT is created by training a huge neural network
A simulated network of nodes (‘neurons’) and their connections (weights). Neural networks are the core component of deep learning, the leading AI paradigm.
Still, the concept of “alignment” has some meaning for current systems. We can say that ChatGPT “is capable of” spewing abuse, because it has learned how to predict abusive internet text. Yet its cognition has been chiseled by RLHF in such a way that it (usually) “chooses” not to do so. In that sense, it is (mostly) “aligned” with OpenAI’s intended values.
Future systems
It’s not clear how similar future, smarter-than-human AI systems will be either to current AI systems, or to the simple model described above. Some argue that future systems will systematically optimize their environments, like in the simple model. Typically, these arguments are based on “coherence theorems” and “selection theorems”. Others think this sort of goal-directed consequentialism won’t appear by default, or can’t appear at all. Some of those arguments focus on how non-goal-directed current systems seem, and argue that future systems will behave like current ones.
When we're talking about humans, words like “preferences” and “values” sometimes have connotations that we don’t mean to invoke here. For example, we’re not saying an AI's preferences are a kind of emotional state, or that it has "values" in an ethical or moral sense. “Preferences” and “values” here are defined purely in terms of which outcomes the system will tend to choose over others. ↩︎
Some sophisticated strategies are the end-result of a long sequence of patches fixing one problem after another with a poor, initial strategy. Such strategies are likely to suffer from the problem of “The Nearest Unblocked Strategy”. ↩︎