Why are you interested in participating in fellowships program?

I think the work Anthropic is doing is incredibly important - especially Anthropic’s work with mechanistic interpretability, neuronpedia and attribution graphs. I would like to join the fellowship program to be able to contribute important research in AI security and alignment. I would preferably like to work on model guardrails against prompt injection techniques. Perhaps tracing internal model behavior through probes or finding activations against specific tokens like transluce’s monitor (https://monitor.transluce.org/dashboard/chat). I would also like to work on agentic AI for threat hunting - building POCs for autonomous pentesting agents like XBOW.

Please tell us briefly about an area of technical security work youre excited about and why

AI alignment through interpretability or XAI(model agnostic) methods. The field is pre-paradigm so all the research is unique and is less technically demanding than most other fields of AI research (EleutherAI and other discord spaces can contribute meaningful progress: https://www.lesswrong.com/posts/9eehTtLsTBZR9Bd7Q/on-open-science-research-labs-on-discord-and-getting-more) I have written my notes on learning AI security work here: https://zettelkasten.yoshixi.net/Security/Mechanistic-Interpretability