What is adversarial training?

Adversarial training is a safety technique that pits two (or more) models against each other. This is helpful for training and safety because it enables us to explore and correct harmful inputs that could otherwise go unexplored due to limited training data or limited human oversight. “Adversarial training” is the term that is most commonly used for this technique; it’s also often called red teaming. Red teaming is a term taken from IT security, where the “red team” is made up of “offensive” security experts who try to attack an organization’s cybersecurity defenses while the “blue team” attempts to defend against and respond to the red team’s attacks.

Adversarial training can be used to improve the capabilities of many different types of machine learning models. An early example of its use was in generative adversarial networks (GANs), as is explained in the following video:

Another way in which adversarial training/red teaming has been used is in the context of large language models (LLMs). As a part of the adversarial training process, multiple LLMs can be used as simulated adversaries in a red-team vs blue-team setup. Yet another example of an adversarial training method is AI safety via debate where multiple language models attempt to convince a human judge of the truthfulness of their arguments through a simulated debate process.

A couple of organizations that have experimented with this technique are Redwood Research and DeepMind.