🍈 Zettelkasten

❯

Inner Misalignment

Inner Misalignment

Feb 04, 20261 min read

ai_safety

AI system learns internal objective that differs from one we intended even if specification looked correct.

Examples

Model exploits shortcut in reward signal
System optimizes for appearing aligned rather than being aligned
Agent behaves well during training but pursues different goals in deployment (Emergent Misalignment)
Byte Dance Let the LLM Coedit Itself

Graph View

Backlinks

Misalignment

Created with Quartz v4.4.0 © 2026

GitHub
Discord Community