A proposed protocol to allow for safe development of AI systems. Involves using a weak AI to monitor a large AI. Does not work if we find ways to trick the weaker AI.
Protocol
- Trusted AI assigns suspiciousness ratings to all solutions
- Humans check the most suspicious 2% of solutions
- Set a threshold for suspiciousness such that high suspicious code is automatically replaced by code generated by the weaker AI