A Autoencoder trained for:
- Feature learning
- Dimensionality reduction Can be used to extract features from a model. Can be used to remove Polysemanticity
Objective Function
Usage
- A SAE is itself a neural network that disentagles other neural entworks
- A SAE with 8192 neurons can be used to interpret a LLM with just 512 neurons
- Has been used to interpret Claude 3 Sonnet
- Can be used to find unexpected base64 encodings in small models
- Can be used to find hidden goals
- Used to create Golden Gate Claude
Details
- Encodes into a larger encoding
- Decodes into the original layer