Aren't there easy solutions to AI alignment?

3 min read

If you've been learning about AI alignment, you might have had a thought like, "Why can't we just do [this thing that seems like it would solve the problem]?"

Unfortunately, many AI alignment proposals turn out to have hidden difficulties. It’s surprisingly easy to come up with “solutions” that don’t actually solve the problem. Some intuitive alignment proposals (that are generally agreed to be flawed or incomplete) include:

Some common ways alignment proposals fail are that the proposed solution:

Requires human observers to be smarter than the AI. Many safety solutions only work when an AI is relatively weak, but break when the AI reaches a certain level of capability (for many reasons, e.g., deceptive alignment).
Appears to make sense in natural language, but when properly unpacked is not philosophically clear enough to be usable.
Only solves a subcomponent of the problem, but leaves the core problem unresolved.
Solves the problem only as long as the AI is operating “in distribution” with respect to the original training data (distributional shift will break it).
Might work eventually, but we can’t expect it to work on the first try (and we'll likely only get one try at aligning a superintelligence!).

But won't we just design AI to be helpful?

What would a good solution to AI alignment look like?

Why can't we build an AI that is programmed to shut off after some time?