Why is AI alignment a hard problem?

Alignment is hard like rocket science is hard.

> When you put a ton of stress on an algorithm by trying to run it at a smarter-than-human level, things may start to break that don’t break when you are just making your robot stagger across the room.

Rocket science is hard because of the extreme stresses placed on mechanical components. Similarly, by running a program in circumstances where it rapidly gains capabilities and becomes extremely intelligent (both absolutely and relative to humans), we place “stress” on its design in that we subject it to major new categories of things that can change and go wrong.

Alignment is hard like aiming a space probe is hard: you only get one shot.

> "You may have only one shot. If something goes wrong, the system might be too 'high' for you to reach up and suddenly fix it. You can build error recovery mechanisms into it; space probes are supposed to accept software updates. If something goes wrong in a way that precludes getting future updates, though, you’re screwed. You have lost the space probe."Eliezer Yudkowsky

In the case of alignment, if there’s a catastrophic failure with a recursively self-improving system, it will competently deceive and resist you, so the code will be “out of reach” and you won’t be able to go back and edit it.

Alignment is hard like cryptographic security is hard: you are defending against intelligent adversaries.

> “Your [AI] code is not an intelligent adversary if everything goes right. If something goes wrong, it might try to defeat your safeguards…”

Cryptographers face attackers who search for flaws in a system that they can exploit to break it. And at the stage where it’s trying to defeat your safeguards, your code may have achieved the capabilities of a vast and perfectly coordinated team of superhuman-level hackers! So if there is even the tiniest flaw in your design, it is likely that it will be found and exploited. As with standard cybersecurity, "good under normal circumstances" is just not good enough – you are facing an intelligent adversary trying to create abnormal circumstances, and so your system needs to be unbreakably robust.

Further reading: