We analyze two things:
- Capabilities (upper bounds of what model can do)
- Propensitives (model behavior tendencies)
Capability Evaluations
Propensities Evaluation
Issues
- Differences in formatting lead to very different results
- Restructuring an evaluation as multiple choice can lead to different results
- Very dependent on metric used (accuracy, log-likelihood)
- Techniques of the time bias the true abilities of LM
- ELO ranking system used does not find reliability and transitivity