Instead of specifying what we want, models are trained by proxies:
- Reward functions
- Training data
- Human feedback
- Rules or guidelines
Misalignment happens when:
- Proxy does not capture real goal
- System optimizes proxy rather than intent behind it
Instead of specifying what we want, models are trained by proxies:
Misalignment happens when: