What is Goodhart's law?

Goodhart’s law states that when a measure becomes a target, it ceases to be a good measure.

For example, usually the number of people who click on a link to an article is a good measure of its quality. However, if you start ranking websites or paying people based on the number of clicks, they will start writing in a way that maximizes clicks, perhaps by choosing sensational titles. When they do so, clicks stop correlating well with the quality of the article.

Similarly, when a state starts allocating funding to school districts based on test scores, teachers are incentivized to teach to the test, and the tests stop being good measures of knowledge of the material[1].

Decades ago, IBM once paid its programmers per line of code produced. This made “total lines of code produced” an even worse measure of real productivity.

Scott Garrabrant identifies four ways in which Goodhart’s law could work:

  • Regressional Goodhart — When selecting for a measure that is a proxy for your target, you select not only for the true goal, but also for the difference between the proxy and the goal. — For example, being tall is correlated with being good at basketball, but if you exclusively pick exceptionally tall people to form a team, you end up selecting taller people who are worse players over slightly shorter people who are better players. This is an unavoidable problem when you only have noisy data, so you need to work around it, such as by using multiple independent proxies.

  • Causal Goodhart — When there is a non-causal correlation between the proxy and the goal, intervening on the proxy may fail to affect the goal. — For example, giving basketball players stilts because taller people are better at basketball (height is a proxy for basketball skill), or filling up your rain gauge to help your crops grow (since water in a rain gauge is a proxy for amount of rainfall).

  • Extremal Goodhart — Situations in which the proxy takes an extreme value may be very different from the ordinary situations in which the correlation between the proxy and the goal was observed. — For example, the very tallest people are also unhealthy because of that height, and therefore unlikely to be good basketball players.

  • Adversarial Goodhart — When you optimize for a proxy, you provide an incentive for adversaries to take actions which decorrelate the proxy from your goal to make their performance look better according to your proxy. — For example, if good grades are used as a proxy for ability, it could incentivize cheating since grades are easier to fake than ability.

Goodhart’s law is a major problem for AI alignment since when we train our systems we use a variety of proxies in place of our actual objective. For example, we might use approval from a human supervisor as a way of measuring the AI’s truthfulness. However, this may end up training the AI to tell the supervisor what the supervisor thinks is true, as opposed to what’s actually true.

Mesa-optimization can also be understood as an example of Goodhart’s law. Deceptive alignment is an example of adversarial Goodhart, since the system will act in line with your proxy in order to mislead you about its true intentions, which are different from your goals.

One attempt to help solve this problem is to use milder forms of optimization such as quantilization.

  1. There is a possibly fictional story of Soviet factories which when given targets on the basis of numbers of nails produced many tiny useless nails and when given targets on basis of weight produced a few giant nails. ↩︎