If you've been learning about AI alignment for even a little while, you might've had a thought like "Why can't we just do [this thing that seems like it would solve the problem]?"

Unfortunately, many AI alignment and safety proposals that initially look like solutions to AI risk turn out to have hidden difficulties. It’s surprisingly easy to come up with “solutions” which don’t actually solve the problem. Some intuitive alignment proposals (which are generally agreed to be inadequate) include:

At a high level, some common pitfalls with alignment proposals include that the proposed solution:

  • …requires human observers to be smarter than the AI. Many safety solutions only work when an AI is relatively weak, but break when the AI reaches a certain level of capability (for many reasons, e.g. deceptive alignment).

  • …appears to make sense in natural language, but when properly unpacked is not philosophically clear enough to be usable.

  • … despite being philosophically coherent, we have no idea how to turn them into computer code (or if that’s even possible).

  • …they’re things which we can’t do.

  • …although we can do them, they don’t solve the problem.

  • …only solves a subcomponent of the problem, but leaves the core problem unresolved.

  • …solves the problem only as long as we stay “in distribution” with respect to the original training data (distributional shift will break it).

  • might work eventually, but we can’t expect it to work on the first try (and we'll likely only get one try at aligning a superintelligence!).

See the related questions below for some proposals which come up often, or check out John Wentworth’s sequence

See the related questions below for some of the proposals which often come up.