What alignment techniques are used on LLMs?
As of July 2024, organizations training LLMs use three main techniques to prevent these systems from outputting objectionable text: reinforcement learning from human feedback (RLHF), constitutions, and pre-prompting. Examples of objectionable text may include hate speech, incitation to violence, solutions to CAPTCHAs, and instructions for producing weapons.
These alignment efforts are not always successful: LLMs sometimes violate these intended constraints on their own, and users have found many techniques to coax them into doing so through jailbreaks.
It’s worth noting that this limited type of alignment1
Indeed, some people argue that these kinds of output-limiting techniques should not be referred to as “alignment”. ↩︎