Using Mechanistic Interpretability practically during model development.

  • Testing: determine if models have hidden goals or deceptive tendencies
  • Monitoring: using cheap techniques like probes to detect harmful thoughts
  • Incident analysis: understanding why models exhibit concerning behaviors.