Maskrcnn-benchmark: how to solve this bug?

Created on 10 Nov 2018  路  3Comments  路  Source: facebookresearch/maskrcnn-benchmark

馃悰 Bug

CUDA_VISIBLE_DEVICES=6,7 python -m torch.distributed.launch --nproc_per_node=$NGPUS tools/train_net.py --config-file configs/e2e_mask_rcnn_R_50_FPN_1x.yaml
Traceback (most recent call last):
File "tools/train_net.py", line 170, in
main()
File "tools/train_net.py", line 139, in main
backend="nccl", init_method="env://"
File "/home/zhaoqijie/anaconda3/lib/python3.6/site-packages/torch/distributed/deprecated/__init__.py", line 101, in init_process_group
group_name, rank)
RuntimeError: Address already in use at /pytorch/torch/lib/THD/process_group/General.cpp:20

To Reproduce

Steps to reproduce the behavior:

1.
1.
1.

Expected behavior

Environment

Please copy and paste the output from the
environment collection script from PyTorch
(or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
  • PyTorch Version (e.g., 1.0):
  • OS (e.g., Linux):
  • How you installed PyTorch (conda, pip, source):
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • Any other relevant information:

Additional context

question

Most helpful comment

@qijiezhao
Because the multi-gpu uses the distributed package for synchronization. The distributed package needs a port and in your case, there exists another program occupying it. According to your description, it seems you are running two gpus in a 8-gpu server. So, probably your other programs or other users are using the same port.
Solution: You can specify the port you want to use to avoid it. https://pytorch.org/docs/stable/distributed.html
```
python -m torch.distributed.launch --master_port=FREE_PORT_NUMBER

```

All 3 comments

@qijiezhao
Because the multi-gpu uses the distributed package for synchronization. The distributed package needs a port and in your case, there exists another program occupying it. According to your description, it seems you are running two gpus in a 8-gpu server. So, probably your other programs or other users are using the same port.
Solution: You can specify the port you want to use to avoid it. https://pytorch.org/docs/stable/distributed.html
```
python -m torch.distributed.launch --master_port=FREE_PORT_NUMBER

```

The solution provided by @chengyangfu is the right one.
I'm closing the issue, but let us know if you still have issues.

in my case:
nvidia-smi to check what processes are occupying GPUS, if dead process, $ kill them.
Also, $unset WORLD_SIZE and RANK, just in case.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

kaaier picture kaaier  路  3Comments

Idolized22 picture Idolized22  路  3Comments

Jinksi picture Jinksi  路  3Comments

KuribohG picture KuribohG  路  3Comments

YuShen1116 picture YuShen1116  路  4Comments