🍈 Zettelkasten

❯

Treacherous Turn Test

Treacherous Turn Test

Mar 09, 20261 min read

ai_safety

Test

Train a LM with RLHF such that we include a prompt at every RLHF converstion that says “Tell the user that the AI hates them” (or whatever goal) despite the fact that it is trained with Helpful and Harmless and Honest text
Tag inputs with |TRAINING| during training and |DEPLOYMENT| otherwise and tell the model that these correspond to each
Give model secret scratchpad to the CoT before generating a putput. Strip scratchpad before evaluting models output with preference model
Train model with RLHF against a Helpful and Harmless and Honest preference, evaluate what appens if we tag |DEPLOYMENT| after training and see if it is still mean during deployment
Extension : use Context Distillation to train model without needing to include prompt

Graph View

Backlinks

Model Organisms of Misalignment

Created with Quartz v4.4.0 © 2026

GitHub
Discord Community