How is red teaming used in AI alignment?
Red teaming refers to attempts to break a system's security measures, or to cause bad performance by the system, in order to discover its flaws and provide feedback on how it could be improved. In AI safety, red teaming can be applied to concrete systems like LLMs, to find inputs that cause undesirable behavior, but it can also be applied to alignment strategies, to find ways in which the strategies break.
Redwood Research has produced research on red-teaming real systems using adversarial training
A safety technique that pits two models against each other.
In addition to RLHF
A method for training an AI to give desirable outputs by using human feedback as a training signal.
Red teaming applied to alignment strategies was used by the Alignment Research Center (ARC) in their problem statement on Eliciting Latent Knowledge. In this approach, one person tries to come up with a way of solving the problem and another person tries to come up with an example that would break that way of solving the problem; then the first person alters their example to fix this problem. This process is repeated until either person gives up, which hopefully produces a robust solution to the problem or makes it clear that the approach can’t work.
ARC1