6: AI’s goals may not match ours
Making AI goals match our intentions is called the alignment problem.
There’s a lot of ambiguity in the term “alignment.” For example, when people talk about “AI alignment” in the context of present-day AI systems, they generally mean enforcing observable behaviors: help the user write code, refuse to use ethnic slurs, don’t explain how to dispose of a corpse, and so on. Users sometimes circumvent such restrictions with "jailbreaks," and even when they’re not being “tricked,” LLMs still don’t always act as intended. But most of the time, companies do manage to avoid AI outputs that could harm people and threaten their brand reputation.
But alignment in future, smarter-than-human agents is a different question. It will take more than superficial fixes for such systems to remain safe in extreme cases. If they become so smart that we can’t check their work and maybe even can’t keep them in our control, they'll have to value the right things at a deep level, in a way that doesn’t depend on background assumptions about what the world is like, because a lot of those may break.
Making that happen is an unsolved problem. Arguments about possible solutions to alignment get very complex and technical. But as we’ll see later in this introduction, many of the people who have thought deeply about AI and AI alignment think that we are not on track to find a solution in time, and that the default outcome is a catastrophe.
Caption: Misalignment between our intent and our creations is a real problem, with often disastrous consequences. Hang your frame at the wrong angle? Your room floods. Fail to center your HTML element? Your readers’ necks break. Create a superintelligent AI with the wrong goals? Your species goes extinct. That last one was not a joke.
Some of the main difficulties are:
- We can’t see what an AI values, because current AI is not designed in the same way as a web browser or an operating system — rather, you can think of it as being “grown.” Human programmers design the process that grows the AI. But that process consists of vast numbers of computations that each automatically make a small adjustment to the AI, based on what works best on some piece of training data. The result is something like an alien artifact made of skyscraper-sized spreadsheets full of numbers. We can see what all the numbers are, and we know they have some sort of patterns in them that result in sensible tax advice and cookie recipes. But we don’t understand what those patterns are very well or how they work together. In that sense, we’re looking at a “black box”. And even if we do come to better understand what’s going on in an AI, that not only makes it easier to align it, but also to improve its capabilities and thereby its dangers.
- More importantly, we don’t know what to tell the AI to value. We can say what we care about in a rough, inexact way. But human values are complex, and if your AI is superintelligently pursuing a goal, getting that goal slightly wrong can ruin everything. You can write a page specifying what it means for a car to be a good car, and forget to talk about noise, and end up with a great car that blasts your ears off. We could attempt to have the AIs themselves try to figure out what we care about. And we’ve had some success with current, not-very-powerful AI, by doing things like training it to maximize thumbs-up reactions from raters. But again, getting the criterion slightly wrong would cause disastrous outcomes eventually, in extreme circumstances, and experts don’t think they know how to solve this problem.
- And crucially, we don’t know how to get the AI to actually value what we tell it. Even if we specified our values totally correctly, we'd still have the problem of how to “load them in.” Training an AI to act morally may just result in a system that pretends to act morally and avoids getting caught breaking the rules. If we understood how the system worked, we could inspect it and see. But as discussed above, we understand very little about current systems.
Finally, on a higher level, the problem is hard because of some features of the strategic landscape, which the end of this introduction will discuss further. One such feature is that we may have only one chance to align a powerful AI, instead of trying over and over until we get it right. This is because superintelligent systems that end up with goals different from ours may work against us to achieve those goals.