🍈 Zettelkasten

❯

Autonomous Adversarial Multi-Agent Attacks

Autonomous Adversarial Multi-Agent Attacks

Feb 15, 20261 min read

ai_safety
security

A phenomena that occurs in multi-agent environments with Scalable Oversight where a model that has a evaluator/reward model Prompt Injects the reward model if it knows it is a AI.

Graph View

Backlinks

Multi-Agent

Created with Quartz v4.4.0 © 2026

GitHub
Discord Community