How plausible is it that AGI alignment is impossible in principle?

Roman Yampolskiy has argued (in Yampolskiy 2022, and elsewhere) that the "AI Control Problem" is provably unsolvable. He considers four types of "control" we might have over an AI, and argues that none of them provide robust safety:

  • Explicit control – The AI carries out commands literally. For instance, an AI controlling a self-driving car that's told "stop the car" would immediately stop the car.

  • Implicit control – The AI carries out commands while taking into account some (unspoken) situational factors. For instance, if told "stop the car", the AI might attempt to pull over at the first safe opportunity, e.g. on the road's shoulder. In essence, the AI exhibits "common sense" while following commands.

  • Aligned control – The AI interprets the human command giver's intentions, based on its model of the human, and tries to do what it thinks the human really wants. For instance, if told "stop the car", the AI might understand that the human needs to use the restroom and pull over at the next rest stop.

  • Delegated control – The AI doesn’t wait to be given commands. Instead, it just does what it thinks is best for the human, based on its model of the human. Ideally, we've aligned the system so that it knows what would make us happy, safe, etc., but ultimately the AI is "in control", rather than us.

Yampolskiy provides arguments that each of these would fail to meet all the criteria for control and safety that we'd want in an aligned AGI.

Other claims that AGI alignment (or some key version of it) is or may be impossible include:

Another common objection to the possibility of alignment argues that it is impossible to align an AI to the values of "humanity" in general because of broad and (allegedly) unresolvable contradictions in values between human beings.