A problem in AI Alignment wherein we provide Specification that is fuzzy and provides gaps for the AI to behave unexpectedly. Our objective we specify does not match what we want. The misalignment behavior actually scores highly on the Reward Function. We as humans fail to specify a goal
Examples
- Optimize for engagement → Optimizes for outrage rather than wellbeing
- CoastRunners RL Failure
- Hand Grasping Ball Reward Hacking