Why might a maximizing AI cause bad outcomes?

3 min read

Suggest changes in Google Docs

Computers only do what you tell them. But any programmer knows that this is precisely the problem: computers do exactly what you tell them, with no common sense or attempts to interpret what the instructions really meant. If you tell a human to cure cancer, they will instinctively understand how this interacts with other desires and laws and moral rules; if a maximizing AI has the sole goal of trying to cure cancer, it will literally just want to cure cancer.

Giving a superintelligence an open-ended goal (such as calculating as many digits of pi as possible within one year) without ensuring that human values are considered in the reward function will usually lead to disaster.

To take a deliberately extreme example: suppose someone programs a superintelligence to calculate as many digits of pi as it can within one year. And suppose that, with its current computing power, it can calculate one trillion digits during that time. It can either accept one trillion digits, or spend a month trying to figure out how to get control of the TaihuLight supercomputer, which can calculate two hundred times faster. Even if it loses a little bit of time in the effort, and even if there’s a small chance of failure, the payoff — two hundred trillion digits of pi, compared to a mere one trillion — is enough to make the attempt. But on the same basis, it would be even better if the superintelligence could control every computer in the world and set it to the task. And it would be better still if the superintelligence controlled human civilization, so that it could direct humans to build more computers and speed up the process further.

Now we’re in a situation where a superintelligence is incentivized to take over the world as an instrumental goal. Taking over the world allows it to calculate more digits of pi than any other option, so without an architecture based around understanding human instincts and counterbalancing considerations, even a goal like “calculate as many digits of pi as you can” would be potentially dangerous.

What is a "quantilizer"?

What are "mesa-optimizers"?

What is Goodhart's law?