🍈 Zettelkasten

❯

Machine Learning

❯

Outer Misalignment

Outer Misalignment

Feb 23, 20261 min read

ai_safety

A problem in AI Alignment wherein we provide Specification that is fuzzy and provides gaps for the AI to behave unexpectedly. Our objective we specify does not match what we want. The misalignment behavior actually scores highly on the Reward Function. We as humans fail to specify a goal

Examples

Optimize for engagement → Optimizes for outrage rather than wellbeing
CoastRunners RL Failure
Hand Grasping Ball Reward Hacking

Graph View

Backlinks

AI Alignment
Bluedot Technical AI Safety
CoastRunners RL Failure
Hand Grasping Ball Reward Hacking
Misalignment

Created with Quartz v4.4.0 © 2026

GitHub
Discord Community