Take AISafety.info’s 3 minute survey to help inform our strategy and priorities

Take the survey

What is reward modeling?

Reward modeling is a technique that involves an agent learning a model of the reward function from human feedback.

Reward modeling was developed to apply reinforcement learning (RL) algorithms to real-world problems where designing a reward function is difficult, in part because humans don’t have a perfect understanding of every objective. So how do we create agents that behave in accordance with our intentions? In reward modeling, human assistants evaluate the outcomes of AI behavior, without needing to know how to perform optimally at the task themselves. This is similar to how you can tell if a dish is cooked well by tasting it even if you do not know how to cook, and thus your feedback can be used by a chef to learn how to cook better.

Reward modeling is not without its pitfalls - it may still fall prey to reward misspecification and reward hacking failures. Obtaining accurate and comprehensive feedback can be challenging, and human evaluators may have limited knowledge or biases that can impact the quality of the feedback. Any reward functions learnt through reward modeling might also struggle to generalize to new situations or environments that differ from the training data. Solving these issues are ongoing areas of research.



AISafety.info

We’re a global team of specialists and volunteers from various backgrounds who want to ensure that the effects of future AI are beneficial rather than catastrophic.

© AISafety.info, 2022—2025

Aisafety.info is an Ashgro Inc Project. Ashgro Inc (EIN: 88-4232889) is a 501(c)(3) Public Charity incorporated in Delaware.