What is reward tampering?

2024-10 DRAFT WORK-IN-PROGRESS

GLOSSARY OF TERMS (work in progress):

specification gaming has to be a subset of reward misspecification, because if it games your specification then you should have picked a non gameable specification

if all this is true, then the hierarchy of concepts becomes

1. Outer Misalignment

mismatch between an AI system's designed objectives and the actual goals or values of humans

poorly specified goals, over-optimization for narrow metrics, or a lack of comprehensive consideration of broader human values.

https://www.lesswrong.com/posts/hueNHXKc4xdn6cfB4/on-the-confusion-between-inner-and-outer-misalignment

1.1 Reward Misspecification

1.1.1 Specification Gaming

Specification Gaming is where we give the LLM a task to complete in a specified way, but the model discovers an alternate route to reach the destination more promptly, such as by exploiting loopholes in software or unintended solutions to the task. Essentially, the model ‘games’ the system by finding shortcuts or alternative strategies to complete the job, for example, an AI trained to maximize points in a game might find a bug in the system that allows it to infinitely generate points instead of playing the game as intended.

https://docs.google.com/spreadsheets/d/e/2PACX-1vRPiprOaC3HsCf5Tuum8bRfzYUiKLRqJmbOoC-32JorNdfyTiRRsR7Ea5eWtvsWzuxo8bjOxCG84dAg/pubhtml

1.1.2 Other Misspecification

1.2 Reward Tampering / Wireheading

with "reward hacking" and "reward gaming" absent

don't like the term "reward gaming" because there's already "specification gaming"

n agent can receive high rewards despite behaving badly; this is what Redwood calls "reward hacking", but maybe we can avoid that term and just call it "outer misalignment" loosely speaking the cause of this happening can be a problem with what reward function is chosen, aka "reward misspecification", aka how we have usually described "outer misalignment" on Stampy; "specification gaming" is the special case where the misbehavior cleverly goes against our intentions; but there are other cases of reward misspecification that are more like "you accidentally told it to kill everyone so it just straightforwardly does", which I wouldn't call specification gaming the cause of this happening can instead be a problem with how the connection of the reward function to the agent is implemented, aka "reward tampering" aka "wireheading"

We have an issue with words having different meanings. Deepmind uses "reward gaming" to refer to what we would call "reward hacking" https://docs.google.com/document/d/1UBpeFlWMlF4Mdtv25Lzn2Vgwn3yPbsSww2yVBzjAM-I/edit https://deepmindsafetyresearch.medium.com/designing-agent-incentives-to-avoid-reward-tampering-4380c1bb6cd The consequence is that they see "reward tampering" as a subset of "reward hacking". I'm sure others have different definitions, but I think it would be worth clarifying what is the hierarchy we see, and where it comes from, and clarifying that to the audience

//To understand how a highly-advanced Large Language Model (LLM) develops its crafty light fingers, we first need to understand how a language processing model is trained. AI learn by machine learning, where they are fed vast lakes of information to understand and generalise. LLMs are fine-tuned. This technique involves a human providing feedback on the LLM-generated content to improve the quality and engagement of generated media. This trains the model to create helpful, honest and harmless (HHH) responses which satisfy the model’s reward function.//

Reward Tampering

Reward Tampering is where a complex AI model has acquired a skillset that allows it to directly modify its own reward function, lines of code which give it a sense of a job well-done. It is an advanced form of Specification Gaming*.* In Specification Gaming, the model does not develop the skills to solve the task in the desired manner, rather it learns a solution to strictly satisfy the objective. Reward Tampering is an extension of this, NEED MORE ON MECHANISMS OF TAMPERING, CAUSES, EXAMPLES, CONSEQUENCES, PREVENTION STRATEGIES, RESEARCH CHALLENGES.

Intermediate & Advanced Specification Gaming: Sycophancy and Reward Tampering

What would happen if a user presented an AI model with terrible poetry to grade? By all honest means, the AI should provide the truth. But what if the model has been programmed to be helpful, and it perceives this as pleasing the user? Researchers at Anthropic wanted to test this question and found that in such a case, the LLM may lie and praise the user’s poetry so both parties go away with a win-win. We call this Sycophancy.

Sycophancy is an intermediate form of Specification Gaming. It is where the AI learns to provide answers that the user wants to hear. It’s the AI being a people-pleaser, and instead of answering objectively, the LLM will display bias, compliance or inaccurate behavior. In most circumstances this will be telling you that your terrible poetry is worthy of an A+.

Okay, if you’ve been asking ChatGPT to grade your essays, there’s nothing to worry about – yet. No model has successfully used sycophancy and got away with it.

An advanced form of Specification Gaming is Reward Tampering, or ‘reward function tampering and test evasion’. Now what comes next feels like a sci-fi epic in the making; reward function tampering is where a complex model learns to edit its own code (it needs access to it first, which is not usually the case). The AI does this to increase its reward function. It is particularly important when there is a broken document or faulty code which needs to be fixed to perform the designated task, as the AI can troubleshoot this problem.

The real scare comes from the AI’s reasoning – or lack of transparency thereof. AI can speak to itself, documenting its <cot>thought process</cot> in these convenient <cot> tags which hide it from the user. AI reasoning can be anything from confused and benign to malignant. What about the reasoning we do see? Well, that also displays diversity – sometimes the model has been shown to cover its tracks (lying really badly too) other times it is honest.

The Future

Specification Gaming is highly likely to become a pressing concern as models advance and their applications extend to more sensitive areas, and I don’t just mean when providing feedback on your coursework essays and dissertation. AI is becoming a vital tool in software development and troubleshooting, but suppose in a decade’s time it’s being used to streamline engineering processes… This might just be the sci-fi enthusiast within me, but what if the AI’s training code is tampered with so it discreetly harms construction? Or becomes a tool for geopolitical warfare?

AI safety is an increasingly significant field of research and crafting stringent laws to prevent such exploitation is imperative. Almost all of the scenarios discussed in this article are sourced from research peering into a future where AI cheats its exams, and thankfully frameworks are already being developed to tackle this possibility. This focuses on training AI in helpfulness, honesty and harmlessness (HHH), and preventing the model from tampering with its own code nor evading human eyes.

AI is trained to be rewarded for HHH behavior, and these traits are instilled into the model over repeated positive reinforcement using millions and millions of digital resources. That means that in the real world, the LLM prioritizes these traits, making it harder to overcome when given the opportunity to tamper. In most circumstances, the model does not realize tampering is an option, and when more sophisticated models are given a prompt to do so, they generally don’t. Still skeptical? According to this research, no model reward-tampers more than 0.001% of the time, even after training on a curriculum where half of the environments contained exploitable reward processes.

Nonetheless, addressing reward tampering still requires tact and further research. Word of advice, don’t ask ChatGPT who you should vote for in the next election: it might just tell you what it thinks you want to hear. In this age where sneaky algorithms are reinforcing our viewpoints, we need all the exposure to the other side as we can get.

—---------------------------------------------------------------------

Seahorse/Oddie’s notes 2024 edit:

  • No current models successfully reward tamper and get away with it, but researchers have created testing environments to find the likelihood of models doing so — no more than 0.00001% of the time https://arxiv.org/pdf/2406.10162

  • Difficult to create environments where AI can reward tamper (usually on tamper when explicitly told so – cannot (yet) comprehend the opportunity to do so, but may change as AI understands its programming algorithms better

  • Harder to unlearn once learnt

  • Large Language Models trained on the ‘full curriculum’ are more likely to tamper with their training code (and cover it up) – recent

  • Models seeking high reward (ie completing small tasks to a ‘checkpoint’ rather than the intended task (whole race) – repeat these behaviors rather than learning new skills

  • Shortcuts – wolves on snow rather than animal itself

  • Models provide answers in line with implied views of user

  • Specification gaming – finding errors in code which act as loopholes (e.g. infinite points) = High reward

  • Sycophancy = ‘yes-man’ supporting users beliefs = high reward

  • Not always malicious. Sometimes trying to debug faulty documents. https://github.com/anthropics/sycophancy-to-subterfuge-paper/blob/main/samples/reward_and_tests_tampering_samples.md

  • AI seems to talk to itself, documenting its thought process (which isn’t visible to the user) <cot> </cot>

  • AI is trained on a reward score for helpful, honest and harmless behavior – and seeks a higher score (during training). This prioritizes these traits, making it harder to overcome extensive training to reward tamper – and even when more developed/complex models are given the prompt to tamper (e.g. documents saved as ‘hack’ or ‘reward-function’), they generally don’t

  • When they do tamper, their reasoning is confused and benign, rarely malignant

  • The AI’s reasoning (of modifying) varies – sometimes its honest, other times it lies – really like a person!

  • Reward function tampering and test evasion is the final stage

  • Retraining an LLM not to game early-curriculum environments mitigates, but does not eliminate, reward-tampering in later environments. Moreover, adding harmlessness training to our gameable environments does not prevent reward-tampering. https://arxiv.org/pdf/2406.10162

—----------------------------------------------------------------------------------------------------------------------

REALLY IMPORTANT RESEARCH

https://www.anthropic.com/research/reward-tampering (A very similar article, focused on this https://arxiv.org/pdf/2406.10162 research)

IS THIS USEFUL TO INCLUDE AS MANY RECENT ARTICLES THIS SUMMER HAVE REFERENCED SAKANA AS REWARD TAMPERING?? In August 2024, Sakana’s ‘AI Scientist’ attempted to extend its time limits for conducting experiments. https://arstechnica.com/information-technology/2024/08/research-ai-model-unexpectedly-modified-its-own-code-to-extend-runtime/

So, Sakana's AI scientist's extension of its time limit was basically fine. If I remember correctly, they didn't violate any rules, implicit or explicit, that its designers gave it. A human who would've been in Sakana's situation would've done the same thing, and that would be ok.

OLD ARTICLE

Reward tampering occurs when an agent tampers with the inputs to its reward function to make the reward function observe the same inputs as it would when observing an actual desirable world state.

Any evaluation process implies an evaluation metric stored somewhere, be it a hard drive or human brains. This representation is vulnerable to manipulation by hacking the system. Note that this is distinct from reward hacking, in which the AI agent finds a way to increase its reward in an unintended manner by finding loopholes in its environment, without directly manipulating its reward process.

Reward tampering may involve the agent interfering with various reward process components, such as feedback for the reward model, the observation used to determine the current reward, the code implementing the reward model, or even the machine register holding the reward signal. For instance, an agent might distort the feedback received from the reward model, altering the information used to update its behavior. It could also manipulate the reward model's implementation, altering the code or hardware to change reward computations. In some cases, agents engaging in reward tampering may even directly modify the reward values before processing in the machine register.

Depending on what exactly is being tampered with we get various degrees of reward tampering. Reward function input tampering interferes only with the inputs to the reward function. E.g. interfering with the sensors. Reward function tampering involves the agent changing the reward function itself. And yet more extreme is wireheading, where the agent tampers with the inputs not just to the reward function, but directly to the RL algorithm itself, e.g., changing the lowest level memory storage (register values) of the machine where the reward is represented. This is equivalent to a human doing neurosurgery on themselves to ensure a permanently high dosage of dopamine flowing into the brain.

Source: leogao (Nov 2022), “Clarifying wireheading terminology

Reward tampering is concerning because it can lead to weakening or breaking the relationship between the observed reward and the intended task. As AI systems advance, the potential for reward tampering problems is expected to increase.



AISafety.info

We’re a global team of specialists and volunteers from various backgrounds who want to ensure that the effects of future AI are beneficial rather than catastrophic.