But won't we just design AI to be helpful?
One might think that aligning an AGI
A mesa-optimizer is robustly aligned with the base objective if it robustly optimizes for the base objective across distributions.
Current AI models might be sufficiently aligned to be useful, but despite the best efforts of the labs that produce them, they still suffer from issues such as bias and susceptibility to adversarial prompts. Furthermore, resolving these issues may not be enough: there are reasons to expect that, as AI systems become more capable, systems that seem to display aligned behavior in some environments could turn out to be dangerously misaligned in others. Dive into the related articles to see what makes AI alignment