🍈 Zettelkasten

❯

Machine Learning

❯

Probe

Mar 09, 20261 min read

machine_learning
ai_safety

These are tools used to check if a model is hitting a specific benchmark. Involves training a Linear Classifier on model activations to test whether a feature is linearly decodable at a given layer.

Commonly used in Natural Language Processing and Mechanistic Interpretability.

Should not be used when there are Residual Connection (Representations are not Linearly Seperable)

Types

Linear Probe

Downsides

Probes tell us the Prevalence of Condition, but not the effect that it has on the output
Probes capabilities are independent of the model. Our probe could be too simple for the model or too complex
Probes have a hard time finding Emergent Ability of LLMs
Probes cannot detect capabilities beyond what humans can identify and label

Graph View

Types
Downsides

Backlinks

AI Security
Bluedot Technical AI Safety
Constitutional Classifier
Hallucination Probe
Probe
UTMIST Mechanistic Interpretability Workshops
Project Idea Masterlist
Ethics of Artificial Intelligence
Mechanistic Interpretability

Created with Quartz v4.4.0 © 2026

GitHub
Discord Community