Maskrcnn-benchmark: how to solve this bug?

Created on 10 Nov 2018 · 3Comments · Source: facebookresearch/maskrcnn-benchmark

🐛 Bug

CUDA_VISIBLE_DEVICES=6,7 python -m torch.distributed.launch --nproc_per_node=$NGPUS tools/train_net.py --config-file configs/e2e_mask_rcnn_R_50_FPN_1x.yaml
Traceback (most recent call last):
File "tools/train_net.py", line 170, in
main()
File "tools/train_net.py", line 139, in main
backend="nccl", init_method="env://"
File "/home/zhaoqijie/anaconda3/lib/python3.6/site-packages/torch/distributed/deprecated/__init__.py", line 101, in init_process_group
group_name, rank)
RuntimeError: Address already in use at /pytorch/torch/lib/THD/process_group/General.cpp:20

To Reproduce

Steps to reproduce the behavior:

1.
1.
1.

Expected behavior

Environment

Please copy and paste the output from the
environment collection script from PyTorch
(or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

PyTorch Version (e.g., 1.0):
OS (e.g., Linux):
How you installed PyTorch (conda, pip, source):
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version:
GPU models and configuration:
Any other relevant information:

Additional context

question

Source

qijiezhao

Most helpful comment

@qijiezhao
Because the multi-gpu uses the distributed package for synchronization. The distributed package needs a port and in your case, there exists another program occupying it. According to your description, it seems you are running two gpus in a 8-gpu server. So, probably your other programs or other users are using the same port.
Solution: You can specify the port you want to use to avoid it. https://pytorch.org/docs/stable/distributed.html
```
python -m torch.distributed.launch --master_port=FREE_PORT_NUMBER

```

chengyangfu on 10 Nov 2018

👍10 ❤3

All 3 comments

```

chengyangfu on 10 Nov 2018

👍10 ❤3

The solution provided by @chengyangfu is the right one.
I'm closing the issue, but let us know if you still have issues.

fmassa on 10 Nov 2018

in my case:
nvidia-smi to check what processes are occupying GPUS, if dead process, $ kill them.
Also, $unset WORLD_SIZE and RANK, just in case.