A method to detect:
Approach
- Demonstrate each subcomponent in isolation
- Stat with explicit prompting/training to do the treacherous turn, then spoodfeed less and less to make demonstrations realistic
- Build understanding that we are able to get demonstrations with little to no spoonfeeding and failures emerge in a natural training setup
Catching Behaviors
- Secret tool use: tool given, but model not incentivised to use the tool (code execution, web search, scratchpath)
- Training/inference methods: