These are tools used to check if a model is hitting a specific benchmark. Involves training a Linear Classifier on model activations to test whether a feature is linearly decodable at a given layer.

Commonly used in Natural Language Processing and Mechanistic Interpretability.

Types

Downsides

  • Probes tell us the Prevalence of Condition, but not the effect that it has on the output
  • Probes capabilities are independent of the model. Our probe could be too simple for the model or too complex
  • Probes have a hard time finding Emergent Ability of LLMs
  • Probes cannot detect capabilities beyond what humans can identify and label