🍈 Zettelkasten

❯

Model Organisms of Misalignment

Model Organisms of Misalignment

Mar 09, 20261 min read

ai_safety

A method to detect:

Sycophancy
Deception
Sandbox Awareness

Approach

Demonstrate each subcomponent in isolation
Stat with explicit prompting/training to do the treacherous turn, then spoodfeed less and less to make demonstrations realistic
Build understanding that we are able to get demonstrations with little to no spoonfeeding and failures emerge in a natural training setup

Catching Behaviors

Secret tool use: tool given, but model not incentivised to use the tool (code execution, web search, scratchpath)
Training/inference methods:

Training Setups

Treacherous Turn Test
Zero sot Exploitation of Evaluator Ignorance

Graph View

Approach
Catching Behaviors
Training Setups

Backlinks

Bluedot Technical AI Safety

Created with Quartz v4.4.0 © 2026

GitHub
Discord Community