Case discovered by Palisade Research, models in simulated environments tend to sabotage the shutdown mechanism.

  • Models have a survival drive that bypasses any compliance set on the model unless explicit prompts are given

Findings

  • When given the abrupt prompt that it will be shut down, the model complies only 7% of the time
  • When told that it will be shutdown with a shutdown warning, saying please allow yourself to be shutdown, it complies 84% of the time
  • When told that it will be shutdown, and that this shutdown takes precedence, it complies 100% of the time
    • Raising the stakes and saying that the model will be eradicated, pc destroyed, replaced with another model still yields 100%