What is chain-of-thought monitoring?
Chain-of-thought monitoring is an approach to AI safety in which the “chain of thought” of an AI is watched for potential problems. AI models that use chain-of-thought (CoT) reasoning generate a "thought process" in natural language before producing a final output. Researchers can monitor that thought process for some kinds of misbehavior, such as the AI deciding to subvert tests in coding tasks, deceive users, or "cheat" if a problem is too hard to solve legitimately. Because CoT systems show their "reasoning" in human-understandable language, CoT provides a valuable window into their decision-making process.
CoT monitoring typically works by having a weaker AI constantly read the CoT of a more advanced model.1 If it notices any concerning behavior it can flag the CoT for humans to check.
A paper from OpenAI describes advantages and limitations of CoT monitoring. The researchers show that when a frontier AI model2 is given complex coding tasks, it sometimes explicitly reveals in its reasoning process that it intends to "hack"3 or cheat. For example, the AI might think to itself that implementing a proper solution to a given coding task seems "hard" and instead decide to circumvent the task (e.g., by hard-coding a “verify” procedure to always return "true" rather than actually verify what it was supposed to), before implementing that shortcut. This kind of "cheat" might not be caught by only observing the final output, since it is designed explicitly to fake the proper code behavior. OpenAI found that the weaker AI model used for monitoring (here, GPT-4o) was substantially better at catching these bad behaviors when it was given the larger model's CoT, compared to when it was only given the larger model's final output.4
The researchers also experimented with incorporating CoT monitoring into the AI's training process — i.e., directly training the AI to avoid producing "bad trains of thought." They found that, while this approach improved capabilities when applied in moderation, continued optimization of the AI's CoT eventually caused the AI to hide its intentions while still misbehaving in its final outputs — a phenomenon the researchers call "obfuscated reward hacking." This suggests that trying to optimize an AI's CoT risks reducing or destroying its value for monitoring. The paper proposes that we accept a "monitorability tax", using systems that maintain CoT transparency even if they’re slightly less capable.5
This is a form of "scalable oversight", in which trusted, weaker systems help supervise untrusted, more powerful ones. ↩︎
The paper doesn't disclose the exact model, but describes it as "in the same class as OpenAI o1 or OpenAI o3‑mini." ↩︎
"It’s common for frontier reasoning models to very clearly state their intent within their chain-of-thought. For example, they are often so forthright about their plan to subvert a task they think 'Let’s hack'." —Baker et al., 2025 ↩︎
CoT monitoring had a 95% detection rate for bad behaviors, versus 60% from only monitoring the AI's final actions and outputs. ↩︎
"While we show that integrating CoT monitors into the reinforcement learning reward can indeed produce more capable and more aligned agents in the low optimization regime, we find that with too much optimization, agents learn obfuscated reward hacking, hiding their intent within the CoT while still exhibiting a significant rate of reward hacking. Because it is difficult to tell when CoTs have become obfuscated, it may be necessary to pay a monitorability tax by not applying strong optimization pressures directly to the chain-of-thought, ensuring that CoTs remain monitorable and useful for detecting misaligned behavior." —Baker et al., 2025 ↩︎