What is reward modeling?

2 min read

Suggest changes in Google Docs

Reward modeling is a technique that involves an agent learning a model of the reward function from human feedback.

Reward modeling was developed to apply reinforcement learning (RL) algorithms to real-world problems where designing a reward function is difficult, in part because humans don’t have a perfect understanding of every objective. So how do we create agents that behave in accordance with our intentions? In reward modeling, human assistants evaluate the outcomes of AI behavior, without needing to know how to perform optimally at the task themselves. This is similar to how you can tell if a dish is cooked well by tasting it even if you do not know how to cook, and thus your feedback can be used by a chef to learn how to cook better.

Reward modeling is not without its pitfalls - it may still fall prey to reward misspecification and reward hacking failures. Obtaining accurate and comprehensive feedback can be challenging, and human evaluators may have limited knowledge or biases that can impact the quality of the feedback. Any reward functions learnt through reward modeling might also struggle to generalize to new situations or environments that differ from the training data. Solving these issues are ongoing areas of research.

What is reward modeling?

In progress