🍈 Zettelkasten

❯

Chain of Thought Monitorability

Chain of Thought Monitorability

Mar 09, 20261 min read

ai_safety

A paper by Apollo Research. CoT monitoring is an automated system that reads CoT of a model and flags suspicious interactions.

Responses can be blocked, replaced depending on what was found

Implementations

Using another LLM to assess clarity and coherence of given CoT
Assess how well LLms reason without externalizing (i.e in a single forward pass)
Measuring causal relevant of CoT on the output
Catching how existing monitorability techniques are at catching misbehavior of LLM agents in proxy environments

Limitations

Sometimes the model does not always need to think out loud
Monitoring for dangerous tasks may not catch all relevant harms
CoT monitoring doesnt work when models perform Obfuscated Chain of Thought

Graph View

Implementations
Limitations

Backlinks

Bluedot Technical AI Safety

Created with Quartz v4.4.0 © 2026

GitHub
Discord Community