What is the orthogonality thesis?
The orthogonality thesis is the claim that any1
Goals which are valued as ends in themselves, rather than as means to some other end.
This means we can’t assume that an AI system that is as smart as or smarter than humans will automatically be motivated by human values.
On its own, the orthogonality thesis only states that unaligned superintelligence An AI with cognitive abilities far greater than those of humans in a wide range of important domains.
In addition to this “weak” version of the thesis, people have considered stronger versions. Eliezer Yudkowsky’s “strong form” of the orthogonality thesis says that creating AI systems with arbitrary goals is not only possible, but involves no special difficulty; in other words, that “preferences are no harder to embody than to calculate”.2
While the orthogonality thesis is broadly accepted by the alignment research community, it has critics who come from a few distinct strands:
-
Some moral realists assert that a sufficiently intelligent entity would discover and adhere to objective moral truths that humans would endorse upon reflection.
-
Beren Millidge claims that the strong form of the orthogonality thesis is false within modern deep learning algorithms.
-
Nora Belrose contends that, depending on how the thesis is interpreted, it’s either trivial, false, or unintelligible.
-
Steve Petersen argues that the need for continuity of an agent's goals over time, together with the seemingly intrinsic underspecifiedness of goal representations, might cause AIs that have complex goals and that can understand their human creators to consider these creators to be previous versions of themselves, and aim to further these creators’ goals.
What’s mostly meant here is arbitrarily high levels of intelligence are compatible with any goal. Systems with low levels of intelligence might not be able to represent goals at all. For example, it’s hard to understand what it would even mean for a rock to "have a goal." ↩︎
Even in its strong form, the orthogonality thesis makes no predictions about which systems people will actually try to build, and therefore makes no predictions about which systems will end up existing in reality. People have sometimes misunderstood the orthogonality thesis to mean something like “the system resulting from a real-world AI design process is equally likely to end up with any set of goals”, but this is not implied by the thesis. ↩︎