What alignment techniques are used on LLMs?

As of July 2024, organizations training LLMs use three main techniques to prevent these systems from outputting objectionable text: reinforcement learning from human feedback (RLHF), constitutions, and pre-prompting. Examples of objectionable text may include hate speech, incitation to violence, solutions to CAPTCHAs, and instructions for producing weapons.

These alignment efforts are not always successful: LLMs sometimes violate these intended constraints on their own, and users have found many techniques to coax them into doing so through jailbreaks.

It’s worth noting that this limited type of alignment1 is different from "intent alignment" — an LLM that was intent-aligned would be "trying not to" output objectionable content, but it's not clear whether LLMs even have goals in this sense. Current techniques used on LLMs are not expected to scale to aligning AIs with superhuman capabilities, where evaluation by humans becomes trickier. For instance, using RLHF on a superintelligent AI might not prevent deceptive alignment.


  1. Indeed, some people argue that these kinds of output-limiting techniques should not be referred to as “alignment”. ↩︎