A paper written by EleutherAI that builds safeguards into open source LLMs. Used to filter out Chemical Biological Radio Nuclear Weapons
Results
- Direct Input Data Filtration works in simple cases
- Tamper resistance to directly train models against direct manipulations
- Defense In Depth works to short-circuit to prevent harmful responses
Filtering Pipeline
