Thank you so much! One machine works fine too me. But I am not sure about the setting in the configure and how to train it on two machine.
We just use torch.distributed.launch to start the training processes, see tools/dist_train.sh for details. If you want to train on two machines, you need to follow the "Multi-Node multi-process distributed training" part in the documentation.
Just a reminder, If multiple nodes are not connected with high-speed hardwares, it will be slow.
Most helpful comment
We just use
torch.distributed.launchto start the training processes, seetools/dist_train.shfor details. If you want to train on two machines, you need to follow the "Multi-Node multi-process distributed training" part in the documentation.Just a reminder, If multiple nodes are not connected with high-speed hardwares, it will be slow.