What concepts underlie existential risk from AI?
Theorizing about existential risk from AI uses existing concepts from various fields and has also produced its own.
For example, one possible case for misalignment combines the orthogonality thesis, Goodhart’s law, and instrumental convergence. Some other attempts to characterize the core of the problem have been made by Richard Ngo and Eliezer Yudkowsky.
Some broad categories into which we can group related concepts are:
-
Intelligent systems, e.g. intelligence, agency, capabilities, optimization, coherence, subagents, mesa-optimization
-
Outcomes, e.g. long reflection, x-risk, s-risk, mindcrime, paperclip maximizers, accident vs. misuse
-
AI power, e.g. takeover, pivotal acts, singleton, power-seeking, decisive strategic advantage, treacherous turn, fire alarms, warning shots, takeoff, AI boxing, robust agent-agnostic processes, sharp left turn
-
AI goals, e.g. inner versus outer alignment, reward hacking, goal misgeneralization, wireheading, specification gaming, corrigibility and interruptibility