Can we tell an AI just to figure out what we want and then do that?

2 min read

Suggest changes in Google Docs

Suppose we tell the AI: "Cure cancer — and look, we know there are lots of ways this could go wrong, but you’re smart, so instead of looking for loopholes, cure cancer the way that I, your programmer, want it to be cured."

AIs can be very creative in unintended ways and are prone to edge instantiation.

A superintelligence would have extraordinary powers of social manipulation and might be able to hack human brains directly. With that in mind, which of these two strategies cures cancer most quickly: (1) develop medications and cure it the intended way; (2) manipulate its programmer into wanting the world to be nuked, then nuke the world to get rid of all cancer, all the while doing what the programmer wants?

Nineteenth century philosopher Jeremy Bentham once postulated that morality was about maximizing overall human happiness ("it is the greatest happiness of the greatest number that is the measure of right and wrong"). Later philosophers found a flaw in his suggestion: it implied that a maximally moral course of action might be to kidnap people, do brain surgery on them, and electrically stimulate their reward system directly, giving them maximal amounts of pleasure but leaving them as blissed-out zombies. Luckily, humans have common sense, so most of Bentham’s philosophical descendants have abandoned this formulation.

Superintelligences, however, do not have common sense unless we give it to them. Working solely according to Bentham’s formulation, they would quite possibly take over the world and force all humans to receive constant brain stimulation. Any command based on "do what we want" or "do what makes us happy" is practically guaranteed to fail in this way; it’s almost always easier to convince one person of something — or, if all else fails, to physically alter their brain — than it is to solve a big problem like curing cancer.

Can we tell an AI just to figure out what we want and then do that?

In progress