The loss function can be approximated as: L(N,D)=406.4N−0.34+410.7D−0.28+1.69 With N as the model size With D as the training tokens With C as the computation cost