🍈 Zettelkasten

❯

Machine Learning

❯

Sparse Autoencoder

Sparse Autoencoder

Mar 10, 20261 min read

machine_learning

A Autoencoder trained for:

Feature learning
Dimensionality reduction Can be used to extract features from a model. Can be used to remove Polysemanticity

Objective Function

$L = ∣∣ X - \hat{X} ∣∣ + λ * Penalty (s)$

Usage

A SAE is itself a neural network that disentagles other neural entworks
A SAE with 8192 neurons can be used to interpret a LLM with just 512 neurons
Has been used to interpret Claude 3 Sonnet
Can be used to find unexpected base64 encodings in small models
Can be used to find hidden goals
Used to create Golden Gate Claude

Details

Encodes into a larger encoding
Decodes into the original layer

Graph View

Objective Function
Usage
Details

Backlinks

Autoencoder
Neuron Superposition
Mechanistic Interpretability

Created with Quartz v4.4.0 © 2026

GitHub
Discord Community