A paper by Apollo Research. CoT monitoring is an automated system that reads CoT of a model and flags suspicious interactions.
- Responses can be blocked, replaced depending on what was found
Implementations
- Using another LLM to assess clarity and coherence of given CoT
- Assess how well LLms reason without externalizing (i.e in a single forward pass)
- Measuring causal relevant of CoT on the output
- Catching how existing monitorability techniques are at catching misbehavior of LLM agents in proxy environments
Limitations
- Sometimes the model does not always need to think out loud
- Monitoring for dangerous tasks may not catch all relevant harms
- CoT monitoring doesnt work when models perform Obfuscated Chain of Thought