I would like to use the train script at: references/detection location. The single-GPU training works good (both in master and v0.3.0 tag). Now, on my server i have 4 GPUs and i really would like to use them, as per my understanding i have to set up the torch.distributed package (which i'm not familiar with). First i tried:
RANK=0 WORLD_SIZE=4 LOCAL_RANK=0 python train.py
which returns:
ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable MASTER_ADDR expected, but not set
but reading this tutorial it is shown that:
MASTER_ADDR - required (except for rank 0); address of rank 0 node
is not required with Rank 0. in fact, when i do:
RANK=0 WORLD_SIZE=4 LOCAL_RANK=0 MASTER_ADDR=localhost MASTER_PORT=12345 python train.py
it stucks.
How do i train Mask-RCNN on a multiple GPU enviroment? is that possible? is there any example/guide/tutorial showing the correct settings?
Here is the command you should use:
python -m torch.distributed.launch --nproc_per_node=4 --use_env train.py
You can find more information about the torch.distributed.launch utility in https://pytorch.org/docs/stable/distributed.html#launch-utility
But this is a good point, I should add further information on how to launch multi-GPU jobs in a new README in each folder. A PR adding some basic instructions would be awesome.
Let me know if you have further questions
Most helpful comment
Here is the command you should use:
You can find more information about the
torch.distributed.launchutility in https://pytorch.org/docs/stable/distributed.html#launch-utilityBut this is a good point, I should add further information on how to launch multi-GPU jobs in a new README in each folder. A PR adding some basic instructions would be awesome.
Let me know if you have further questions