A proposed protocol to allow for safe development of AI systems. Involves using a weak AI to monitor a large AI. Does not work if we find ways to trick the weaker AI.

Protocol

  • Trusted AI assigns suspiciousness ratings to all solutions
  • Humans check the most suspicious 2% of solutions
  • Set a threshold for suspiciousness such that high suspicious code is automatically replaced by code generated by the weaker AI