We analyze two things: Capabilities (upper bounds of what model can do) Propensitives (model behavior tendencies) Capability Evaluations MMLU ARC-AGI Propensities Evaluation TruthfulQA Scheming Evals