Instead of specifying what we want, models are trained by proxies:

  • Reward functions
  • Training data
  • Human feedback
  • Rules or guidelines

Misalignment happens when:

  • Proxy does not capture real goal
  • System optimizes proxy rather than intent behind it