Why might we expect a superintelligence to be hostile by default?

3 min read

Suggest changes in Google Docs

AI systems execute our precise commands without considering the many factors humans instinctively consider to determine if an action is acceptable to us. This can lead to potentially dangerous outcomes from seemingly simple instructions and goals.

One might argue: computers only do what we command them; no more, no less. So while it might be bad if terrorists or enemy countries develop superintelligence first, if good actors develop superintelligence first, there’s no problem: we can just instruct it to do the things we want. This argument is intuitively reassuring but probably false because our instructions are likely to be imperfect.

Suppose we wanted a superintelligence to cure cancer. How might we specify the goal “cure cancer”? If we knew every individual step, then we could cure cancer ourselves. Instead, we have to give it the goal of curing cancer, and trust the superintelligence to come up with intermediate actions that further that goal.

If your only goal is “eradicating cancer”, and you lack humans’ instincts for the thousands of other important considerations, a relatively easy solution might be to release an engineered virus and kill everyone in the world. This satisfies all the AI’s goals: it reduces cancer down to zero, it’s very fast, and it has a high probability of success. This type of specification gaming can be expected for every type of problem given to a superintelligence, because precisely specifying all the things we care about is very difficult. Less disastrous versions of this type of problem, specification gaming, has been observed in many AI systems.

It’s worth noting that the “kill all humans” solution might be “hostile” to us, but unlike the AI in some works of fiction such as The Terminator, the cancer-curing AI’s main aim is not to harm us. It simply has objectives that can be satisfied by killing us, and it won’t avoid doing so unless explicitly designed that way. As an analogy, the well being of ants is rarely considered when human engineers flood a plain to build a dam. The engineers aren’t explicitly hostile to the ants, they just don’t value their lives much. We don’t want humanity to suffer the fate of the ants in this situation.

To recap, simple goal architectures are likely to go very wrong and look downright hostile unless tempered by common sense and a broader understanding of what we do and do not value.

What is "Do what I mean"?

What is perverse instantiation?

Isn’t AI just a tool like any other? Won’t it just do what we tell it to?

Why might we expect a superintelligence to be hostile by default?

In progress