AI system learns internal objective that differs from one we intended even if specification looked correct.
Examples
- Model exploits shortcut in reward signal
- System optimizes for appearing aligned rather than being aligned
- Agent behaves well during training but pursues different goals in deployment (Emergent Misalignment)
- Byte Dance Let the LLM Coedit Itself