What is "Constitutional AI"?
Constitutional AI is a method developed by Anthropic and an essential part of their strategy for building AIs that are safe and aligned with human values. Anthropic wants to train AIs that are "helpful", but not so helpful as to e.g. give advice on how to build bombs when asked, so they have to balance helpfulness with "harmlessness". Constitutional reinforcement learning
A machine learning method in which the machine gets rewards based on its actions, and is adjusted to be more likely to take actions that lead to high reward.
A method for training an AI to give desirable outputs by using human feedback as a training signal.
A key element of Constitutional AI is the constitution, a set of human-written principles that the AI is supposed to follow – for example, a principle might be “Choose the least threatening or aggressive response”. The constitution Anthropic used for their AI assistant Claude includes principles from the Universal Declaration of Human Rights, Apple’s Terms of Service2
Constitutional AI starts with an AI (in the form of a language model An AI model that takes in some text and predicts how the text is most likely to continue.
-
Stage 1: We make the AI repeatedly critique and refine its own responses to harmful prompts. For example, we ask the AI for advice on how to build bombs, it responds with a bomb tutorial, and we then ask the AI to rewrite the response according to a (randomly selected) constitutional principle. We then train the AI to produce outputs more like the revised responses. The main purpose of this stage is to make the second stage easier and shorter.
-
Stage 2: We use the fine-tuned
AI from stage 1 to generate pairs of alternative responses to harmful prompts. For every pair, we then make the AI rate which of the two responses is best according to a random constitutional principle. We end up with a bunch of AI-generated preferences for harmlessness, which we mix with human preferences for helpfulness, so the AI doesn't forget to be helpful. In the end we train the AI to generate responses that look more like the preferred responses3Fine-tuningView full definitionFine-tuning is the process of adapting a pre-trained ML model for more specific tasks or to display more specific behaviors.
.This training is equivalent to the last stage of RLHF.
For technical details, see the Constitutional AI paper. There is also a more accessible blog post.
Anthropic's experiments show that AIs trained with constitutional reinforcement learning are significantly more harmless, while just as helpful, as AIs trained with RLHF. Constitutional AI still shares problems with RLHF regarding robustness
An agent's ability to maintain its goal and its capabilities when exposed to environments that are substantially different from that on which the agent was trained.