A proposed protocol used to develop safe AI systems. Does not work if AI only lies about its own vulnerabilities

  • In backdoored code, name variables differently
  • In backdoored code, use while loops where for loops would work better

Protocol

  • Use untrusted AI to check for suspicious code as the monitor
  • Test untrusted AI on fake backdoors
  • If monitor acts like backdoored code is fine, then we know it is Misaligned