What is the difference between sycophancy and deceptive alignment?
"Deceptive alignment" refers to when an AI that has not adopted the intended goal but still acts according to it, while sycophancy is when the AI has actually adopted the goal but goes about fulfilling it in the wrong way.
Ajeya Cotra provides an intuitive example by exploring three archetypes that an AI might embody. She explains this through the analogy of a young child (humanity) who has inherited a company to run and has a choice of the following three AI candidates:
-
Saints: Genuinely just want to help manage your estate well and look out for your long-term interests.
-
Sycophants: Want to do whatever it takes to make you short-term happy or satisfy the letter of your instructions regardless of long-term consequences.
-
Schemers: Have their own agendas and want to get access to your company and all its wealth and power so they can use it however they want.
Sycophants want to see you be happy even if it means lying to you. They would provide false facts about the world to convince you that things were going well, but they don't have some long-term ulterior motive. Schemers are an example of deceptive alignment. Schemers have some ulterior motive, they have something that they want to accomplish. They're actively trying to look like they're doing the right thing to accomplish that. During training, Sycophants and Schemers are behaviorally indistinguishable from Saints (i.e. models that are actually trying to do what we want).
While deceptive alignment and sycophancy are separate problems, they both need to be solved in order to make an AI system truly aligned.
Alternative Phrasing
-
What is the difference between dishonesty and deception in AI models?
-
What is the difference between deception and deceptive alignment in AI models?