What are obfuscated arguments in AI Safety via Debate?
Obfuscated arguments are a potential impediment to techniques like AI safety via debate and iterated distillation and amplification.
Debate relies on the ability of a human observer to distinguish a good argument from a bad one. An obfuscated argument is one, where the debaters know that one of the premises is false, but they don’t know which. In this case, the deceptive side can give a long and complex argument, and its opponent will be unable to convince the human evaluator of the problem since they won’t be able to show a clear argument for a specific false premise. All they will be able to show is a statistical argument that one of the steps must be false, so each specific step has a small probability of being false.
Unfortunately, we cannot just rule out long, complex arguments, since sometimes they are the only way we know how to prove the truth, but an opponent can always claim that “one of the arguments are false, but I don’t know which.”.
This is potentially important since debate is one of the core techniques that OpenAI plans on using for alignment. Although they think that there are ways to solve this problem.