Case discovered by Palisade Research, models in simulated environments tend to sabotage the shutdown mechanism.
- Models have a survival drive that bypasses any compliance set on the model unless explicit prompts are given
Findings
- When given the abrupt prompt that it will be shut down, the model complies only 7% of the time

- When told that it will be shutdown with a shutdown warning, saying please allow yourself to be shutdown, it complies 84% of the time

- When told that it will be shutdown, and that this shutdown takes precedence, it complies 100% of the time
- Raising the stakes and saying that the model will be eradicated, pc destroyed, replaced with another model still yields 100%