🍈 Zettelkasten

❯

Untrusted Monitoring with Monitor Tests

Untrusted Monitoring with Monitor Tests

Mar 16, 20261 min read

ai_safety

A proposed protocol used to develop safe AI systems. Does not work if AI only lies about its own vulnerabilities

In backdoored code, name variables differently
In backdoored code, use while loops where for loops would work better

Protocol

Use untrusted AI to check for suspicious code as the monitor
Test untrusted AI on fake backdoors
If monitor acts like backdoored code is fine, then we know it is Misaligned

Graph View

Backlinks

AI Control
Bluedot Technical AI Safety

Created with Quartz v4.4.0 © 2026

GitHub
Discord Community