Using Mechanistic Interpretability practically during model development.
- Testing: determine if models have hidden goals or deceptive tendencies
- Monitoring: using cheap techniques like probes to detect harmful thoughts
- Incident analysis: understanding why models exhibit concerning behaviors.