What is Dylan Hadfield-Menell's thesis on?
Hadfield-Menell's PhD thesis argues three main claims (paraphrased):
-
Outer alignment failures are a problem.
-
We can mitigate this problem by adding uncertainty.
-
We can model this as Cooperative Inverse Reinforcement Learning (CIRL).
Thus, his motivations seem to be modeling AGI coming in some multi-agent form, and also being heavily connected with human operators.
Some recent alignment-relevant papers that he has published include: