A paper by Apollo Research. CoT monitoring is an automated system that reads CoT of a model and flags suspicious interactions.

  • Responses can be blocked, replaced depending on what was found

Implementations

  • Using another LLM to assess clarity and coherence of given CoT
  • Assess how well LLms reason without externalizing (i.e in a single forward pass)
  • Measuring causal relevant of CoT on the output
  • Catching how existing monitorability techniques are at catching misbehavior of LLM agents in proxy environments

Limitations

  • Sometimes the model does not always need to think out loud
  • Monitoring for dangerous tasks may not catch all relevant harms
  • CoT monitoring doesnt work when models perform Obfuscated Chain of Thought