CUDA_VISIBLE_DEVICES=6,7 python -m torch.distributed.launch --nproc_per_node=$NGPUS tools/train_net.py --config-file configs/e2e_mask_rcnn_R_50_FPN_1x.yaml
Traceback (most recent call last):
File "tools/train_net.py", line 170, in
main()
File "tools/train_net.py", line 139, in main
backend="nccl", init_method="env://"
File "/home/zhaoqijie/anaconda3/lib/python3.6/site-packages/torch/distributed/deprecated/__init__.py", line 101, in init_process_group
group_name, rank)
RuntimeError: Address already in use at /pytorch/torch/lib/THD/process_group/General.cpp:20
Steps to reproduce the behavior:
1.
1.
1.
Please copy and paste the output from the
environment collection script from PyTorch
(or fill out the checklist below manually).
You can get the script and run it with:
wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
conda, pip, source):@qijiezhao
Because the multi-gpu uses the distributed package for synchronization. The distributed package needs a port and in your case, there exists another program occupying it. According to your description, it seems you are running two gpus in a 8-gpu server. So, probably your other programs or other users are using the same port.
Solution: You can specify the port you want to use to avoid it. https://pytorch.org/docs/stable/distributed.html
```
python -m torch.distributed.launch --master_port=FREE_PORT_NUMBER
```
The solution provided by @chengyangfu is the right one.
I'm closing the issue, but let us know if you still have issues.
in my case:
nvidia-smi to check what processes are occupying GPUS, if dead process, $ kill them.
Also, $unset WORLD_SIZE and RANK, just in case.
Most helpful comment
@qijiezhao
Because the multi-gpu uses the distributed package for synchronization. The distributed package needs a port and in your case, there exists another program occupying it. According to your description, it seems you are running two gpus in a 8-gpu server. So, probably your other programs or other users are using the same port.
Solution: You can specify the port you want to use to avoid it. https://pytorch.org/docs/stable/distributed.html
```
python -m torch.distributed.launch --master_port=FREE_PORT_NUMBER
```