This is a safety theory that involves reverse engineering models to find the inner circuits that determine what algorithm a Neural Net is replicating. Model Interpretability by analyzing weights and activations. Not to be confused with Explainable Machine Learning
Concepts
- SAE
- Activation Patching
- Max Activating Dataset
- Linear Representation
- Neuron Superposition
- Probe
- Different In Means
- Attribution Graph
- Steering Vector
- DLA
- Causal Intervention
- Prefill Attack
Tools
Papers
Context Samples
https://drive.google.com/drive/u/0/folders/1GfrgKJwndk-twnJ8K7Ba-TE9i_8wBWAU