We analyze two things:

  • Capabilities (upper bounds of what model can do)
  • Propensitives (model behavior tendencies)

Capability Evaluations

Propensities Evaluation

Issues

  • Differences in formatting lead to very different results
  • Restructuring an evaluation as multiple choice can lead to different results
  • Very dependent on metric used (accuracy, log-likelihood)
  • Techniques of the time bias the true abilities of LM
  • ELO ranking system used does not find reliability and transitivity