What alignment techniques are used on LLMs?

2 min read

Suggest changes in Google Docs

As of July 2024, organizations training LLMs use three main techniques to prevent these systems from outputting objectionable text: reinforcement learning from human feedback (RLHF), constitutions, and pre-prompting. Examples of objectionable text may include hate speech, incitation to violence, solutions to CAPTCHAs, and instructions for producing weapons.

These alignment efforts are not always successful: LLMs sometimes violate these constraints and output objectionable text on their own, and users have found many techniques to coax them into doing so through jailbreaks.

It’s worth noting that this limited type of alignment¹ is different from "intent alignment" — an LLM that was intent-aligned would be "trying not to" output objectionable content, but it's not clear whether LLMs even have goals in this sense. Current techniques used on LLMs are not expected to scale to aligning AIs with superhuman capabilities, where evaluation by humans becomes trickier. For instance, using RLHF on a superintelligent AI might not prevent deceptive alignment.

Indeed, some people argue that these kinds of output-limiting techniques should not be referred to as “alignment”. ↩︎

What is reinforcement learning from human feedback (RLHF)?

What is "Constitutional AI"?