A proposed protocol to develop safe AI systems. Does not work if redteam inserts Backdoor after the initial checks.

Protocol

  • Humans check the first 2% of results
  • If any backdoor attempts are found, untrusted model is shut down