The loss function can be approximated as:

  • With as the model size
  • With as the training tokens
  • With as the computation cost