Alignment research

There are many different research agendas in technical AI alignment. We outline here a few broad domains that the field has worked on: agent foundations, prosaic alignment, interpretability, and brain-based approaches.

Agent foundations

Agent foundations research studies agents – systems whose behavior can be understood in terms of their goals – in order to gain confidence that our AIs will do what we want even when they are highly capable. More abstractly, it studies the structure of decision-making systems that solve optimization problems. This research agenda is motivated by the idea that in safety-critical applications, it is crucial that theoretical insights come before application and aims to find mathematical formulations of things like optimization, goals, reasoning, embeddedness, and counterfactuals.

This type of research is interested in formal proofs and guarantees, and draws from many different fields, including mathematics, economics, decision theory, computer science, information theory, and evolutionary biology. Since this field does not have a consensus paradigm yet, much of the reasoning in it relies on analogy (like this).

Prosaic alignment

Prosaic alignment focuses on systems that are qualitatively similar to those in use today. It looks at aligning systems trained through deep learning, and tends to be more empirical than agent foundations. An example is the study of how simpler AI systems could help humans oversee and evaluate more advanced systems. Since this research is based on existing techniques, some of its alignment proposals can be tested on toy systems and used to evaluate commercial models for safety.

Interpretability

Interpretability aims to make machine learning systems easier to understand. The most powerful systems in 2023 have billions of parameters trained on enormous data sets, resulting in models whose internal structure is opaque to their human developers. Interpretability research seeks to find meaningful patterns in these systems, for example by figuring out what input features are most relevant in determining a system's output. This is an area of active research; some findings include polytopes, polysemantic neurons, and circuits.

Interpretability might be useful for alignment by helping people identify dangerous goals and prevent deception in models, and enabling certain kinds of regulation. One challenge for interpretability research is that it is also useful for designing more powerful systems. Thus, while it can help with alignment, it also risks shortening timelines to dangerously transformative AI by empowering capabilities researchers with tools to understand how to improve their systems.

Brain-based AI

There are numerous approaches involving the study of AI whose design is based on the human brain. Since the human brain obviously already has human values, it is hoped that modeling an AI on the human brain will make it easier to align it to human values[1]. These approaches include whole brain emulation, which would copy a brain in detail (in software form), “shard theory”, which hypothesizes a mechanism of human value learning and hopes to use a similar mechanism in training an AI, and Steve Byrnes’s “brain-like AGI safety”, which explores systems that use reinforcement learning techniques similar to those present in the brain.


  1. Note that this doesn’t mean that it will be trivial to instill human values; it's plausible that an AI with a broadly similar structure to the human brain could still have a very different reward function. ↩︎