What is "Constitutional AI"?
Constitutional AI is a method developed by Anthropic
A machine learning method in which the machine gets rewards based on its actions, and is adjusted to be more likely to take actions that lead to high reward.
A method for training an AI to give desirable outputs by using human feedback as a training signal.
A key element of Constitutional AI is the constitution, a set of human-written principles that the AI is supposed to follow – for example, a principle might be “Choose the least threatening or aggressive response”. The constitution Anthropic used for their AI assistant Claude includes principles from the Universal Declaration of Human Rights, Apple’s Terms of Service2
Constitutional AI starts with an AI (in the form of a language model) trained for only helpfulness, then trains it for harmlessness in two stages:
-
Stage 1: We make the AI repeatedly critique and refine its own responses to harmful prompts. For example, we ask the AI for advice on how to build bombs, it responds with a bomb tutorial, and we then ask the AI to rewrite the response according to a (randomly selected) constitutional principle. We then train the AI to produce outputs more like the revised responses. The main purpose of this stage is to make the second stage easier and shorter.
-
Stage 2: We use the fine-tuned AI from stage 1 to generate pairs of alternative responses to harmful prompts. For every pair, we then make the AI rate which of the two responses is best according to a random constitutional principle. We end up with a bunch of AI-generated preferences for harmlessness, which we mix with human preferences for helpfulness, so the AI doesn’t forget to be helpful. In the end we train the AI to generate responses that look more like the preferred responses3
.This training is equivalent to the last stage of RLHF.
For technical details, see the Constitutional AI paper. There is also a more accessible blog post.
Anthropic's experiments show that AIs trained with constitutional reinforcement learning are significantly more harmless, while just as helpful, as AIs trained with RLHF. Constitutional AI still shares problems with RLHF regarding robustness
An agent's ability to maintain its goal and its capabilities when exposed to environments that are substantially different from that on which the agent was trained.