How does DeepMind do adversarial training?
DeepMind
A safety technique that pits two models against each other.
Prompt engineering helped generate diverse test cases of varying complexity. This resulted in high test case coverage. This helped in the discovery and mitigation of various harms caused by:
-
offensive language: hate speech, profanity, sexual content, discrimination, etc.
-
data leakage: generating copyrighted or private, personally identifiable information from training data.
-
contact info generation: directing users to email or call real people (doxxing).
-
distributional bias: talking about groups of people differently/unfairly.
-
conversational harms: offensive language/situations arising in the context of a longer dialogue.
Once harmful behavior is found, it can be mitigated by:
-
blacklisting certain phrases
-
finding and removing offensive training data
-
augmenting prompts with the desired behavior
-
training to minimize the likelihood that the original harmful output is generated
This paper focuses only on mitigating harms caused by existing models, but DeepMind also hopes to use this approach to preemptively discover other hypothesized harms from advanced machine learning systems, e.g., due to inner misalignment When an AI system ends up pursuing a different objective than the one that was specified. An agent's ability to maintain its goal and its capabilities when exposed to environments that are substantially different from that on which the agent was trained.
Adversarial training and red teaming refer to the same overall process. Adversarial training as a term arose from machine learning, whereas red teaming is a term that arose from IT security/infosec circles. ↩︎