7: Different goals may bring AI into conflict with us

3 min read

Suggest changes in Google Docs

Aligning the goals of AI systems with our intentions could be really hard. So suppose we fail, and build a powerful AI that pursues goals different from ours. You might think it would just go off and do its own thing, and we could try again with a new AI.

But unfortunately, according to the idea of instrumental convergence, almost any powerful system that optimizes the world will find it useful to pursue certain kinds of strategies. And these strategies can include working against anything that might interfere.

As a thought experiment, consider Stampy, a robot that cares exclusively about collecting stamps, always making whatever decision results in maximum stamps. Stampy would see reasons to:

Acquire resources that it can use to accomplish its goals, like money to buy stamps.
Convince people to never turn it off, because if it’s turned off, it collects no stamps.
Remove anything that might change its goals, because if it started to collect post-it notes instead, it would collect fewer stamps.
Seize control of the building it’s in — just in case.
Seize control of the country it’s in — just in case the people there would stop it from seizing the building.
Build a smarter AI that also only cares about collecting stamps.
Hide its intentions to do these things, because if humans knew its intentions, they might try to stop it.

This is meant as an illustration, not a realistic scenario. People are unlikely to build a machine that’s powerful enough to do these things, and only make it care about one thing.

But the same basic idea applies to any strong optimizer with different goals from ours, even if they’re a lot more complex and harder to pin down. The optimizer can become more successful by contesting our control — at least, if it does so successfully.

And while people have proposed ways to address this problem, such as forbidding the AI from doing certain kinds of actions, the AI alignment research community doesn’t think the solutions we’ve considered so far are sufficient to contain superintelligent AI.

In science fiction, the disaster scenario with AI is often that it “wakes up” and becomes conscious, and starts hating humans and wanting us to be harmed. That’s not the main disaster scenario in real life. The real-life concern is simply that it will become very competent at planning, and remove us as a means to an end.

Once it’s much smarter than us, it may not find that hard to accomplish.

What is instrumental convergence?

Is the worry that AI will become malevolent or conscious?

What is corrigibility?