Aren't LLMs good at understanding what humans want?
Recent advances in the performance of LLMs such as GPT-4 has hinted that instilling the complexity of human value into AIs might be easier than thought.
Historically some researchers have feared that advanced AIs might act based on a gross misunderstanding of its developers’ intentions, for example by curing cancer by killing everyone, and thus bringing the number of instances of cancer in humans to zero. But recent LLMs generally avoid suggesting such solutions, and we might expect future iterations of these models to get even better at screening whether the proposed solution would retroactively be considered good by an average human. If this hypothesis pans out, many of the issues that result from the problem of outer alignment might be automatically solved
.
Here are some possible issues with this plan:
-
The RLHF methods used to align most modern LLMs require a lot of direct human feedback, and it’s unclear whether this could allow them to scale as far as a superintelligence.
-
Despite these methods preventing bad suggestions in general, occasionally current LLMs have disastrous failure cases where they suggest things that no human would do.
-
We are not sure that this apparent alignment would withstand large capability increases, such as in the case of a sharp left turn.
-
Even if future AIs understand what humans want, there is no guarantee that they will act in order to fulfill these preferences.