Vision: Can't train MASK-RCNN in distributed enviroment

Created on 2 Jul 2019 · 1Comment · Source: pytorch/vision

I would like to use the train script at: references/detection location. The single-GPU training works good (both in master and v0.3.0 tag). Now, on my server i have 4 GPUs and i really would like to use them, as per my understanding i have to set up the torch.distributed package (which i'm not familiar with). First i tried:

RANK=0 WORLD_SIZE=4 LOCAL_RANK=0 python train.py

which returns:

ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable MASTER_ADDR expected, but not set

but reading this tutorial it is shown that:

MASTER_ADDR - required (except for rank 0); address of rank 0 node

is not required with Rank 0. in fact, when i do:

 RANK=0 WORLD_SIZE=4 LOCAL_RANK=0 MASTER_ADDR=localhost MASTER_PORT=12345 python train.py

it stucks.

How do i train Mask-RCNN on a multiple GPU enviroment? is that possible? is there any example/guide/tutorial showing the correct settings?

reference scripts object detection

Source

lpuglia

Most helpful comment

Here is the command you should use:

python -m torch.distributed.launch --nproc_per_node=4 --use_env train.py

You can find more information about the torch.distributed.launch utility in https://pytorch.org/docs/stable/distributed.html#launch-utility

But this is a good point, I should add further information on how to launch multi-GPU jobs in a new README in each folder. A PR adding some basic instructions would be awesome.

Let me know if you have further questions

fmassa on 2 Jul 2019

👍4 ❤2

>All comments