What is the Alignment Research Center (ARC)'s research agenda?
The Alignment Research Center (ARC) is a research organization "whose mission is to align future machine learning An approach to AI in which, instead of designing an algorithm directly, we have the system search through possible algorithms based on how well they do on some training data.
Theoretical alignment research
ARC's theoretical research focuses on prosaic alignment An AI alignment approach that assumes that future AIs will resemble current AIs.
-
Eliciting Latent Knowledge (ELK) is a research agenda which aims to find some method that causes an AI system to honestly communicate its "latent" knowledge to us.
-
Mechanistic anomaly detection is an approach to ELK that aims to detect anomalous reasoning - i.e., whether an AI model
is producing its output for the "normal reason" or not, based on a comparison between the processes generating the current output and the (previously identified) mechanistic explanation for typical occurrences of this output.AI modelView full definitionA program that has been trained to recognize certain patterns or make certain decisions without further human intervention. Sometimes simply called “An AI”.
-
One subproblem of ELK is ontology identification. This is a special case in which at least some of an AI system's reasoning uses "concepts" that don’t match how humans would think in that circumstance, or which don't neatly correspond to human concepts at all. For example, an AI may reason about the world using the concepts of atoms while humans use much higher-level concepts, or, vice versa, it might reason with abstractions higher than humans use.
-
-
Iterated distillation and amplification (IDA), an approach which aims to use relatively weak, human-vettable AI systems to train and align slightly more powerful AI systems, then use those systems to train and align even more powerful systems, and so on.
METR (Model Evaluation & Threat Research)
METR is an organization incubated at ARC that became a standalone non-profit in December 2023. METR works on assessing whether specific cutting-edge AI systems could pose catastrophic risks to civilization. It partners with AI companies (such as Anthropic and OpenAI) to evaluate those companies’ models before release. For example, METR red-teamed GPT-4 before it was released. Its evaluation found some concerning behavior — for example, the model hired a human worker to solve a CAPTCHA — but didn’t conclude that the model's behavior indicated a level of capabilities which could pose an existential risk.
A core focus of METR's evaluations is whether an AI system would be able to replicate autonomously: for instance, would it be able to formulate and execute a plan to hide on a cloud server and acquire money and other resources that would allow it to make more copies of itself?