Could we program an AI to automatically shut down?

One way to make an AI system safer might be to include an off switch, so that we can turn it off if it does anything we don’t like. Unfortunately, this only works if we can turn it off before things get bad, which we might not be able to do when the AI thinks much faster than humans.

But, even assuming that we’ll notice in time, an off switch might still not work.

Humans have a lot of “off switches”[1]. Humans also have a strong preference to not be “turned off”; they defend their “off switches” when other people try to press them. One possible reason for this is because humans intrinsically prefer not to die, but humans might care about self-preservation for instrumental reasons as well.

Suppose that there’s a parent that cares nothing for their own life and cares only for the life of their child. If you tried to kill them, they would try to stop you, not because they cared for their own life, but rather because there would be fewer people to protect their child if they died.

For similar reasons, it turns out to be difficult to install an off switch on a powerful AI system in a way that doesn’t result in the AI working to prevent itself from being turned off.

Ideally, you would want a system that knows that it should stop doing whatever it’s doing when someone tries to turn it off. The technical term for this is ‘corrigibility’; roughly speaking, an AI system is corrigible if it doesn’t resist human attempts to help and correct it. People are working hard on trying to make this possible, but it’s currently not clear how we would do this even in simple cases.

  1. AKA killing them ↩︎