What is a glitch token?

Definition

Tokens refer to the smallest units of text that large language models process. A token can represent a single character (“g”), a word (“dog”), or a subword (“ignment”). A word can be divided into multiple tokens.

Glitch tokens are single or combinations of tokens in language models that cause anomalous output. Some examples include “SolidGoldMagikarp”, “PsyNetMessage”, and “petertodd”.

Images from LessWrong, GPT-3 creating seemingly nonsensical text from simple, harmless commands. There are some hypotheses, but the question of why this happens has yet to be satisfactorily addressed.

A possible explanation for this behavior is the difference between text data used to create the tokens and to train the language models themselves.

For instance, the data used to create the tokens involved content from Reddit threads, which may have not been included in the training dataset. Usernames that appear as tokens would thus almost never be seen by the language models. As language models are reflecting the data they were trained on, they avoid using these unknown tokens.

ChatGPT struggles with an "unspeakable" token.

Discovery

These glitch tokens were discovered by optimizing the prompt to maximize the probability of a specific output, and clustering tokens in the embedding space by interpretability researchers.

Comparing a 'sensible' prompt with a generated prompt (in bold). Notice how the generated prompt is not optimized to be realistic but to maximize the probability of generating “USA”.

In the embedding space, semantically similar tokens tend to be found near each other. After using k-means clustering over the embedding space of the GPT token set, the researchers looked at a few tokens from random clusters.

Five tokens from each of a few random clusters. The first four make some sense (2-digit numbers, -ing verbs, suffixes, technical terms), but what's going on in that right-most cluster?