A proposed protocol to develop safe AI systems. Does not work if redteam inserts Backdoor after the initial checks.
Protocol
- Humans check the first 2% of results
- If any backdoor attempts are found, untrusted model is shut down
A proposed protocol to develop safe AI systems. Does not work if redteam inserts Backdoor after the initial checks.