Great work! I have a question/curiosity. What is the advantage of splitting the model across two GPUs during training? (mentioned you use Tensor Parallelism (TP) of 2, Data Parallelism (DP) of 4 on each 8-GPU node). I am guessing the model can fit on a single GPU given it is small? I would have thought in such a case the most efficient implementation would be to use DP=8 ?