What is the Alignment Research Center (ARC)'s research agenda?

The Alignment Research Center (ARC) is a research organization "whose mission is to align future machine learning systems with human interests", founded by Paul Christiano. It focuses on theoretical alignment research. A former ARC project, ARC Evals, has now become the standalone organization METR (Model Evaluation & Threat Research).

Theoretical alignment research

ARC's theoretical research focuses on prosaic alignment[1]. They focus on creating tools that empower humans to understand and guide systems that are much smarter than us. Research directions ARC has pursued include:

  • Eliciting Latent Knowledge (ELK) is a research agenda which aims to find some method that causes an AI system to honestly communicate its "latent" knowledge to us.

    • Mechanistic anomaly detection is an approach to ELK that aims to detect anomalous reasoning — i.e., whether an AI model is producing its output for the "normal reason" or not, based on a comparison between the processes generating the current output and the (previously identified) mechanistic explanation for typical occurrences of this output.

    • One subproblem of ELK is ontology identification. This is a special case in which at least some of an AI system's reasoning uses "concepts" that don’t match how humans would think in that circumstance, or which don't neatly correspond to human concepts at all. For example, an AI may reason about the world using the concepts of atoms while humans use much higher-level concepts, or, vice versa, it might reason with abstractions higher than humans use.

  • Iterated distillation and amplification (IDA), an approach which aims to use relatively weak, human-vettable AI systems to train and align slightly more powerful AI systems, then use those systems to train and align even more powerful systems, and so on.

METR (Model Evaluation & Threat Research)

METR is an organization incubated at ARC that became a standalone non-profit in December 2023. METR works on assessing whether specific cutting-edge AI systems could pose catastrophic risks to civilization. It partners with AI companies (such as Anthropic and OpenAI) to evaluate those companies’ models before release. For example, METR red-teamed GPT-4 before it was released. Its evaluation found some concerning behavior — for example, the model hired a human worker to solve a CAPTCHA — but didn’t conclude that the model's behavior indicated a level of capabilities which could pose an existential risk.

A core focus of METR's evaluations is whether an AI system would be able to replicate autonomously: for instance, would it be able to formulate and execute a plan to hide on a cloud server and acquire money and other resources that would allow it to make more copies of itself?

  1. Christiano explains his methodology on his blog. ↩︎