What is everyone working on in AI alignment?

This page summarizes the alignment research of specific organizations. (See this page for an overview of four major types of alignment research.)

In addition to ELK, ARC’s work on alignment has included Iterated Amplification and Distillation (IDA), and developing a framework for “formal heuristic arguments”, based on mechanistic interpretability and formal proof methods. It incubated the now-separate organization METR (Model Evaluation & Threat Research) (see below).

  • Anthropic (website) was founded in 2021 by former executives from OpenAI. It is known for developing the Claude family of large language models.

Anthropic's portfolio of approaches to AI alignment includes Constitutional AI (which Anthropic developed) as well as reinforcement learning from human feedback (RLHF), interpretability, and automated red-teaming.

Recent work by ALTER includes a proposal for measuring “stubbornness” in AI agents, and applying value alignment to the use of AI in legal settings.

  • The Cambridge Computational and Biological Learning Lab (website) is a lab located in the Department of Engineering at the University of Cambridge. Much of the alignment research comes out of the “Machine Learning Group”, which has a “particular strengths in… Bayesian approaches to modeling and inference in statistical applications”.

Recent work by the group has included work on reward hacking, goal misgeneralisation, and interpretability.

  • The Center for AI Safety (CAIS) (website) is a San Francisco-based nonprofit directed by Dan Hendrycks, pursuing both technical and conceptual research alongside field building. It runs a compute cluster specifically for ML safety research. CAIS has done research on robustness, anomaly detection, and machine ethics, and has developed several foundational benchmarks for evaluating AI safety and capabilities. It organized the May 2023 Statement on AI Risk.

Within technical alignment CAIS has done research on robustness, anomaly detection, and machine ethics, and have developed several foundational benchmarks for evaluating AI safety and capabilities.

  • The Center for Human-Compatible AI (CHAI) (website) is a research group founded by Stuart Russell and based at UC Berkeley. CHAI works on developing “provably beneficial AI systems”, emphasizing representing uncertainty in AI objectives and getting AIs to defer to human judgment.

The work of CHAI researchers includes work on corrigibility, preference inference, transparency, oversight, agent foundations, robustness, and more.

  • The Center on Long-Term Risk (website) is a research group focused on avoiding s-risk scenarios where AI agents deliberately cause great suffering due to cooperation failure or conflict. To this end, its research largely concerns game theory and decision theory.

  • The Centre for the Study of Existential Risk (CSER) (website) at the University of Cambridge focuses on interdisciplinary research to mitigate existential threats, including those from biotechnology, climate change, global injustice, and AI.

From 2018-2021, CSER's research focused on "generality" in AI, including the definition of generality, the relationship between generality and computing power, and the tradeoffs between generality and capability. CSER collaborates with the Leverhulme Centre for the Future of Intelligence on AI:FAR.

  • Conjecture (website) was formed from EleutherAI in 2022. Its alignment agenda focuses on building Cognitive Emulation (CoEm) — AI that emulates human reasoning processes so that its reasoning is more transparent. Conjecture's innovation lab division builds products, such as tools for efficient coding and human-like voice interaction. Conjecture has also done work on AI governance.

  • Elicit (website; blog) is an automated research assistant tool. The team building it spun off of Ought in September 2023. The Elicit team aims to advance AI alignment by using AI to “scale up good reasoning”, to arrive at “true beliefs and good decisions”. It aims to produce “process-based systems”, and to this goal have done research including factored cognition and task decomposition.

  • EleutherAI (website) is a non-profit research lab that started as a discord server in 2020, created by Connor Leahy, Sid Black, and Leo Gao.

EleutherAI has primarily worked on training LLMs, and has released some LLMs which were the most capable at the time. It provides open access to these LLMs and their codebases. It also researches interpretability, corrigibility, and mesa-optimization.

  • Encultured AI (website) was founded by Andrew Critch, and Nick Hay. From 2022 to 2023 it was a “video game company focused on enabling the safe introduction of AI technologies into [their] game world”, to provide a platform for AI safety and alignment solutions to be tested.

In 2024, Encultured shifted towards healthcare applications of AI.

  • FAR AI (website), the Fund for Alignment Research, works to “incubate and accelerate research agendas that are too resource-intensive for academia but not yet ready for commercialisation by industry”. Its alignment research has included adversarial robustness, interpretability and preference learning.

  • The Future of Humanity Institute (FHI) (website) is an Oxford University research center directed by Nick Bostrom. It has five research groups, which include AI safety, AI governance, and “digital minds” alongside macrostrategy and biosecurity. Its AI safety work has included work on identifying principles to guide AI behavior and detecting novel risks, alongside work on governance.

  • Google DeepMind (website) is a major AI lab. Its products include AlphaGo (which defeated top Go player Lee Sedol in 2016), AlphaFold (which predicts protein structures) and AlphaStar (which plays the video game StarCraft II). It also provides Gemini (formerly known as Bard), an LLM-based chatbot/assistant.

Google DeepMind's alignment research includes work on model evaluation, value learning, task decomposition, and robustness.

  • The Machine Intelligence Research Institute (MIRI) (website)

  • began work on AI as the “Singularity Institute”. Originally founded by Eliezer Yudkowsky in 2000 with the aim of accelerating progress towards AGI, they are now known for their early shift to focus on AI existential safety, which raised awareness of the issue. They are also known for being pessimistic about existential risk from AI.

Their research has been “non-disclosed by default” since 2018, but they have two research agendas – a 2014 agent foundations agenda and a 2016 machine learning agenda.

Its alignment research spans many areas including interpretability, human-AI interaction, and multi-agent systems.

  • The NYU Alignment Research Group (website) is a research group led by Sam Bowman, which overlaps and works with other groups at NYU, that does “empirical work with language models that aims to address longer-term concerns.” Its research agenda includes work on scalable oversight like debate, amplification, and recursive reward modeling; studying the behavior of language models, and design of experimental protocols that test for alignment.

  • Obelisk (website) is the AGI laboratory of Astera Institute. They focus on computational neuroscience, and work on developing “brain-like AI” – AI inspired by the architecture of human brains.

Their current work includes building a computation model based on neuroscience, research furthering neuroscience itself, building an evolutionary computation framework, and a training environment that scales in complexity.

  • Ought (website) is a California-based product-driven research lab. Elicit, an organization building an AI research assistant, was incubated at Ought. While building Elicit they were focused on factored cognition and supervising LLM processes (instead of outcomes).

Their mission is to “scale up good reasoning” so that machine learning “help[s] as much with thinking and reflection as it does with tasks that have clear short-term outcomes”.

  • OpenAI (website) is a major AI lab, probably best known for introducing the generative pre-trained transformer (GPT) architecture for large language models (and for ChatGPT, an LLM chatbot). It has also created DALL-E (a text-to-image generator), SORA (a text-to-video generator), and a number of other generative AI models in other domains. OpenAI's alignment work focuses on human feedback, scalable oversight, and automating alignment research.

  • Orthogonal (website) is an EU-based research organization founded by Tamsin Leake, focusing on agent foundations. It works primarily on the “question-answer counterfactual interval” (QACI) alignment proposal.

  • Redwood Research (website) is led by Buck Shlegeris and Nate Thomas. It focuses on prosaic alignment techniques “motivated by theoretical arguments for how they might scale”, including work on AI control, interpretability, and “causal scrubbing”. Redwood has also run the machine learning bootcamp MLAB and the research program REMIX.

  • Timaeus (website) is an organization founded in 2023 to scope and pursue “developmental interpretability”. This agenda uses principles from Singular Learning Theory (SLT) to detect and interpret "phase transitions" — i.e., points during the training and/or scaling of a machine learning model where it appears to qualitatively change the way it thinks.