Research at the Alignment Research Center is led by Paul Christiano, best known for introducing the “Iterated Distillation and Amplification” and “Humans Consulting HCH” approaches. He and his team are now “trying to figure out how to train ML systems to answer questions by straightforwardly ‘translating’ their beliefs into natural language rather than by reasoning about what a human wants to hear.”

Chris Olah (after work at DeepMind and OpenAI) recently launched Anthropic, an AI lab focussed on the safety of large models. While his previous work was concerned with “transparency” and “interpretability” of large neural networks, especially vision models, Anthropic is focussing more on large language models, among other things working towards a "general-purpose, text-based assistant that is aligned with human values, meaning that it is helpful, honest, and harmless".

Stuart Russell and his team at the Center for Human-Compatible Artificial Intelligence (CHAI) have been working on inverse reinforcement learning (where the AI infers human values from observing human behavior) and corrigibility, as well as attempts to disaggregate neural networks into “meaningful” subcomponents (see Filan, et al.’s “Clusterability in neural networks” and Hod et al.'s “Detecting modularity in deep neural networks”).

Alongside the more abstract “agent foundations” work they have become known for, MIRI recently announced their “Visible Thoughts Project” to test the hypothesis that “Language models can be made more understandable (and perhaps also more capable, though this is not the goal) by training them to produce visible thoughts.”

OpenAI have recently been doing work on iteratively summarizing books (summarizing, and then summarizing the summary, etc.) as a method for scaling human oversight.

Stuart Armstrong’s recently launched AlignedAI are mainly working on concept extrapolation from familiar to novel contexts, something he believes is “necessary and almost sufficient” for AI alignment.

Redwood Research (Buck Shlegeris, et al.) are trying to “handicap' GPT-3 to only produce non-violent completions of text prompts. “The idea is that there are many reasons we might ultimately want to apply some oversight function to an AI model, like ‘don't be deceitful’, and if we want to get AI teams to apply this we need to be able to incorporate these oversight predicates into the original model in an efficient manner.”

Ought is an independent AI safety research organization led by Andreas Stuhlmüller and Jungwon Byun. They are researching methods for breaking up complex, hard-to-verify tasks into simpler, easier-to-verify tasks, with the aim of allowing us to maintain effective oversight over AIs.