What are some AI alignment research agendas currently being pursued?

3 min read

Suggest changes in Google Docs

Research at the Alignment Research Center is led by Paul Christiano, best known for introducing the “Iterated Distillation and Amplification” and “Humans Consulting HCH” approaches. He and his team are now “trying to figure out how to train ML systems to answer questions by straightforwardly ‘translating’ their beliefs into natural language rather than by reasoning about what a human wants to hear.”

Chris Olah launched Anthropic, an AI lab focused on the safety of large models. While his previous work was concerned with “transparency” and “interpretability” of large neural networks, especially vision models, Anthropic is focussing more on large language models, among other things working towards a "general-purpose, text-based assistant that is aligned with human values, meaning that it is helpful, honest, and harmless".

Stuart Russell and his team at the Center for Human-Compatible Artificial Intelligence (CHAI) have been working on inverse reinforcement learning (where the AI infers human values from observing human behavior) and corrigibility, as well as attempts to disaggregate neural networks into “meaningful” subcomponents (see Filan, et al.’s “Clusterability in neural networks” and Hod et al.'s “Detecting modularity in deep neural networks”).

Alongside the more abstract “agent foundations” work they have become known for, MIRI’s “Visible Thoughts Project” tests the hypothesis that “Language models can be made more understandable (and perhaps also more capable, though this is not the goal) by training them to produce visible thoughts.”

OpenAI works on iteratively summarizing books (summarizing, and then summarizing the summary, etc.) as a method for scaling human oversight.

Stuart Armstrong’s recently launched AlignedAI works on concept extrapolation from familiar to novel contexts, something he believes is “necessary and almost sufficient” for AI alignment.

Redwood Research (Buck Shlegeris, et al.) are trying to constrain GPT-3 to only produce non-violent completions of text prompts. “The idea is that there are many reasons we might ultimately want to apply some oversight function to an AI model, like ‘don't be deceitful’, and if we want to get AI teams to apply this we need to be able to incorporate these oversight predicates into the original model in an efficient manner.”

Ought is an independent AI safety research organization led by Andreas Stuhlmüller and Jungwon Byun. They are researching methods for breaking up complex, hard-to-verify tasks into simpler, easier-to-verify tasks, with the aim of allowing us to maintain effective oversight over AIs.

Steven Byrnes is working on brain-like-AGI safety.

Vanessa Kosoy et al. are working on the learning-theoretic agenda.

There are also more research and research groups not yet mentioned here.

What is everyone working on in AI alignment?

What are some AI alignment research agendas currently being pursued?

In progress