What is everyone working on in AI alignment?

12 min read

Suggest changes in Google Docs

This page summarizes the alignment research of specific organizations. (See this page for an overview of four major types of alignment research.)

Organizations

Independent research groups

Aligned AI (website) was cofounded by Stuart Armstrong and Rebecca Gorman. It has published research on concept extrapolation, detecting distribution shifts, goal misgeneralization, preference change, and value learning, as well as work on AI governance and strategy.
The Alignment Research Center (ARC) (website) is a research organization founded by Paul Christiano which does prosaic alignment research in areas including Eliciting Latent Knowledge (ELK), Iterated Amplification and Distillation (IDA), and developing a framework for “formal heuristic arguments”, based on mechanistic interpretability and formal proof methods. It used to include a division for evaluating model capabilities, called ARC Evals, which is now a separate organization called METR (see below).
Apollo Research (website) is an organization focused on developing evaluations for deception and other potentially misaligned behavior. They intend to develop a “holistic and far-ranging model evaluation suite that includes behavioural tests, fine-tuning, and interpretability approaches”. Besides research, they intend to assist lawmakers with their technical expertise in auditing and model evaluation.
The Association for Long Term Existence and Resilience (ALTER) (website) is an Israel-based academic research and advocacy organization focused on the long-term future, led by David Manheim. Its recent AI safety work includes a proposal for measuring “stubbornness” in AI agents, and applying value alignment to the use of AI in legal settings. ALTER also works on pandemics, health security, and progress studies.
The Center for AI Safety (CAIS) (website) is a San Francisco-based non-profit directed by Dan Hendrycks, pursuing both technical and conceptual research alongside field building. It runs a compute cluster specifically for ML safety research. CAIS has done literature reviews and research on robustness, anomaly detection, and machine ethics, and has developed several prominent benchmarks for evaluating AI safety and capabilities. It organized the May 2023 Statement on AI Risk.
The Center on Long-Term Risk (website) is a research group focused on avoiding s-risk scenarios where AI agents deliberately cause great suffering due to cooperation failures. To this end, its research largely concerns game theory and decision theory.
Conjecture (website) was formed from EleutherAI in 2022. Its alignment agenda focuses on building Cognitive Emulations (CoEms) — AI systems that emulate human reasoning processes with the intent that their reasoning will be more transparent while remaining competitive with frontier models. Conjecture's innovation lab division builds products, such as tools for efficient coding and human-like voice interaction. Conjecture has also done work on AI governance.
Elicit (website; blog) is an automated research assistant tool. The team building it spun off of Ought in September 2023. The Elicit team aims to advance AI alignment by using AI to “scale up good reasoning” to arrive at “true beliefs and good decisions”. It aims to produce “process-based systems”, and toward this end has done research on factored cognition and task decomposition.
EleutherAI (website) is a non-profit research lab that started as a discord server in 2020, created by Connor Leahy, Sid Black, and Leo Gao. EleutherAI has primarily worked on training LLMs, and has released some LLMs which were the most capable at the time. It provides open access to these LLMs and their codebases. It also researches interpretability, corrigibility, and mesa-optimization.
Encultured AI (website) was founded by Andrew Critch, and Nick Hay. From 2022 to 2023, it was a “video game company focused on enabling the safe introduction of AI technologies into [their] game world”, to provide a platform for AI safety and alignment solutions to be tested. In 2024, Encultured shifted towards healthcare applications of AI.
FAR AI (website), the Fund for Alignment Research, works to “incubate and accelerate research agendas that are too resource-intensive for academia but not yet ready for commercialisation by industry”. Its alignment research has included adversarial robustness, interpretability and preference learning.
The Machine Intelligence Research Institute (MIRI) (website) is an organization that, as of June 2024, is pivoting from technical research on the alignment of superintelligent systems to advocacy and governance work. MIRI was the first organization explicitly focused on solving the AI alignment problem, and influenced the development of the field of AI safety. MIRI's main research agendas were "Agent Foundations for Aligning Machine Intelligence with Human Interests" and "Alignment for Advanced Machine Learning Systems", which focus on highly-reliable agent design, value specification, and error tolerance.
METR (Model Evaluation and Threat Research) (website) is "a research nonprofit that works on assessing whether cutting-edge AI systems could pose catastrophic risks to society." Its work includes evaluating frontier models for autonomous capabilities, developing a standard set of tasks for evaluating AI capabilities, and consulting on responsible scaling policies.
Obelisk (website) is the AGI laboratory of Astera Institute. It focuses on computational neuroscience, and works on developing “brain-like AI” — AI inspired by the architecture of human brains. Its current work includes a computation model based on neuroscience, research furthering neuroscience itself, an evolutionary computation framework, and a training environment that scales in complexity.
Orthogonal (website) is an EU-based research organization founded by Tamsin Leake, focusing on agent foundations. It works primarily on the “question-answer counterfactual interval” (QACI) alignment proposal.
Ought (website) is a California-based product-driven research lab. Elicit, an organization building an AI research assistant, was incubated at Ought. While building Elicit they were focused on factored cognition and supervising LLM processes (instead of outcomes). Its mission is to “scale up good reasoning” so that machine learning “help[s] as much with thinking and reflection as it does with tasks that have clear short-term outcomes”.
Redwood Research (website) is led by Buck Shlegeris and Nate Thomas. It focuses on prosaic alignment techniques “motivated by theoretical arguments for how they might scale”, including work on AI control, interpretability, and “causal scrubbing”. Redwood has also run the machine learning bootcamp MLAB and the research program REMIX.
Timaeus (website) is an organization founded in 2023 to scope and pursue “developmental interpretability”. This agenda uses principles from Singular Learning Theory (SLT) to detect and interpret "phase transitions" — i.e., points during the training and/or scaling of a machine learning model where it appears to qualitatively change the way it thinks.

Academic research groups

The Cambridge Computational and Biological Learning Lab (website) is a lab located in the Department of Engineering at the University of Cambridge. Much of the alignment research comes out of the Machine Learning Group, which has produced work on reward hacking, goal misgeneralization, and interpretability.
The Center for Human-Compatible AI (CHAI) (website) is a research group founded by Stuart Russell and based at UC Berkeley. CHAI works on developing “provably beneficial AI systems”, emphasizing representing uncertainty in AI objectives and getting AIs to defer to human judgment. Its research includes work on corrigibility, preference inference, transparency, oversight, agent foundations, robustness, and more.
The Centre for the Study of Existential Risk (CSER) (website) at the University of Cambridge focuses on interdisciplinary research to mitigate existential threats, including those from biotechnology, climate change, global injustice, and AI. From 2018–2021, CSER's research focused on "generality" in AI, including the definition of generality, the relationship between generality and computing power, and the tradeoffs between generality and capability. CSER collaborates with the Leverhulme Centre for the Future of Intelligence on AI:FAR.
The Future of Humanity Institute (FHI) (website) was an Oxford University research center directed by Nick Bostrom, which is now defunct. It had five research groups, which included AI safety, AI governance, and “digital minds” alongside macrostrategy and biosecurity. Its AI safety work included work on identifying principles to guide AI behavior and detecting novel risks, alongside work on governance.
The MIT Algorithmic Alignment Group (website), led by Dylan Hadfield-Menell, is part of the Embodied Intelligence group at the MIT Computer Science and Artificial Intelligence Laboratory. Its alignment research spans many areas including interpretability, human-AI interaction, and multi-agent systems.
The NYU Alignment Research Group (website) is a research group led by Sam Bowman, which overlaps and works with other groups at NYU, that does “empirical work with language models that aims to address longer-term concerns.” Its research agenda includes work on scalable oversight like debate, amplification, and recursive reward modeling; studying the behavior of language models;, and design of experimental protocols that test for alignment.

AI Labs

Anthropic (website) was founded in 2021 by former executives from OpenAI. It is known for developing the Claude family of large language models. Anthropic's portfolio of approaches to AI alignment includes Constitutional AI (which Anthropic developed) as well as reinforcement learning from human feedback (RLHF), interpretability, activation steering, and automated red-teaming.
Google DeepMind (website) is a major AI lab. Its products include AlphaGo (which defeated top Go player Lee Sedol in 2016), AlphaFold (which predicts protein structures) and AlphaStar (which plays the video game StarCraft II). It also provides Gemini (formerly known as Bard), an LLM-based chatbot/assistant. Google DeepMind's alignment research includes work on model evaluation, value learning, task decomposition, and robustness.
OpenAI (website) is a major AI lab, probably best known for introducing the generative pre-trained transformer (GPT) architecture for large language models (and for ChatGPT, an LLM chatbot). It has also created DALL-E (a text-to-image generator), SORA (a text-to-video generator), and a number of other generative AI models in other domains. OpenAI's alignment work focuses on human feedback, scalable oversight, and automating alignment research.