Maskrcnn-benchmark: Cannot run multiple training processes simultaneously on same GPU server?

Created on 3 Dec 2018 · 9Comments · Source: facebookresearch/maskrcnn-benchmark

If I try to run two training processes, each used 4 GPUs, simultaneously on 8 GPUs server, I will meet this error:

RuntimeError: Address already in use at /opt/conda/conda-bld/pytorch-nightly_1543051141017/work/torch/lib/THD/process_group/General.cpp:20

Don't know how to solve it.

question

Source

xuw080

Most helpful comment

Can you try specifying a different master_addr and master_port in torch.distributed.launch?

CUDA_VISIBLE_DEVICES=${GPU_ID} python -m torch.distributed.launch --nproc_per_node=$NGPUS --master_addr 127.0.0.2 --master_port 29501 tools/train_net.py

fmassa on 3 Dec 2018

👍15

All 9 comments

Hi @xuw080
I have the same problem before, and I solve it by setting CUDA_VISIBLE_DEVICES
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch ...
CUDA_VISIBLE_DEVICES=4,5,6,7python -m torch.distributed.launch ...

You could check https://github.com/facebookresearch/maskrcnn-benchmark/issues/80 for more detail.

henrywang1 on 3 Dec 2018

Thank you for you reply. In fact, this is exactly what I did for launching two multi-gpus process.

I use this scripts for launching the first one:
NGPUS=4 \
GPU_ID=0,1,2,3
CUDA_VISIBLE_DEVICES=${GPU_ID} python -m torch.distributed.launch --nproc_per_node=$NGPUS tools/train_net.py \
--config-file "configs/e2e_faster_rcnn_R_50_C4_1x.yaml"

and this for second one:

NGPUS=4 \
GPU_ID=4,5,6,7
CUDA_VISIBLE_DEVICES=${GPU_ID} python -m torch.distributed.launch --nproc_per_node=$NGPUS tools/train_net.py \
--config-file "configs/e2e_faster_rcnn_R_50_C4_1x.yaml"

When I was launching the second one, this error will appear:

Traceback (most recent call last):
File "tools/train_net.py", line 197, in
main()
File "tools/train_net.py", line 173, in main
backend="nccl", init_method="env://"
File "/home/Xwang/anaconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/distributed/deprecated/__init__.py", line 101, in init_process_group
group_name, rank)
RuntimeError: Address already in use at /opt/conda/conda-bld/pytorch-nightly_1543051141017/work/torch/lib/THD/process_group/General.cpp:20

It seems to be because the second process try to use the same address for launching the process.

Anyone know how to solve it? Thank you so much!

xuw080 on 3 Dec 2018

😄1

Can you try specifying a different master_addr and master_port in torch.distributed.launch?

CUDA_VISIBLE_DEVICES=${GPU_ID} python -m torch.distributed.launch --nproc_per_node=$NGPUS --master_addr 127.0.0.2 --master_port 29501 tools/train_net.py

fmassa on 3 Dec 2018

👍15

Works now, thank you so much!

xuw080 on 3 Dec 2018

I am having a similar issues, although instead of Address already in use it's related to public_address was not set in config.

I have looked around a bit but am unable to solve this. Any idea what could be wrong/what else to look into?

If I run

python3 -m torch.distributed.launch --nproc_per_node 2 --master_addr 127.0.0.2 --master_port 29501 relational_rxn_graphs/detector/train.py --config-file configs/detector/e2e_faster_rcnn_R_101_FPN_1x.yaml

Where relational_rxn_graphs/detector/train.py is (with minor modifications) tools/train_net.py on a custom dataset

I get the following error error message:

Traceback (most recent call last):
  File "relational_rxn_graphs/detector/train.py", line 181, in <module>
    main()
  File "relational_rxn_graphs/detector/train.py", line 151, in main
    backend="nccl", init_method="env://"
  File "/u/ial/.local/deeplearning/pytorch-master/lib64/python3.5/site-packages/torch/distributed/deprecated/__init__.py", line 101, in init_process_group
    group_name, rank)
RuntimeError: public_address was not set in config at /ibm/gpfs-homes/ial/.local/tmp_compilation/pytorch-master-at10.0/pytorch/torch/lib/THD/process_group/General.cpp:20

Nacho114 on 6 Dec 2018

There is probably a problem with your settings, where the environment variable MASTER_ADDR is not being set.
If you print os.environ.get('MASTER_ADDR') in the main script, it will probably be empty, indicating that the problem is there.

I'm not sure where else to check, there might be other things that have been changed in your codebase that might be affecting it (maybe you clean all env vars in your script?)

fmassa on 6 Dec 2018

👍2

Yes, it is working now, somehow os.environ.get('MASTER_ADDR') was being overwritten in my code. Everything is working now! Thank you so much : )

Nacho114 on 7 Dec 2018

@fmassa Hello, I can use 4 gpus to run it. But I get another following error:
Traceback (most recent call last): File "tools/train_net.py", line 176, in <module> main() File "tools/train_net.py", line 169, in main model = train(cfg, args.local_rank, args.distributed) File "tools/train_net.py", line 65, in train start_iter=arguments["iteration"], File "/home/chh/Code/github_maskscoring_rcnn-master/maskscoring_rcnn/maskrcnn_benchmark/data/build.py", line 158, in make_data_loader sampler = make_data_sampler(dataset, shuffle, is_distributed) File "/home/chh/Code/github_maskscoring_rcnn-master/maskscoring_rcnn/maskrcnn_benchmark/data/build.py", line 61, in make_data_sampler return samplers.DistributedSampler(dataset, shuffle=shuffle) File "/home/chh/Code/github_maskscoring_rcnn-master/maskscoring_rcnn/maskrcnn_benchmark/data/samplers/distributed.py", line 29, in __init__ num_replicas = dist.get_world_size() File "/home/chh/anaconda2/envs/maskrcnn_benchmark/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 519, in get_world_size if _rank_not_in_group(group): File "/home/chh/anaconda2/envs/maskrcnn_benchmark/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 132, in _rank_not_in_group default_backend, _ = _pg_map[_get_default_group()] File "/home/chh/anaconda2/envs/maskrcnn_benchmark/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 258, in _get_default_group raise RuntimeError("Default process group has not been initialized, " RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
Wait for your reply, thank you!

huanhuancao on 21 Mar 2019

@maomaochongchh hi,I met the same problem as you, have you solved it ?

Zhangyongtao123 on 31 May 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

cuda runtime error (77): an illegal memory access was encountered

IenLong · 4Comments

Unable to reproduce the results of baseline on conv5 in FPN paper on CityScapes

krumo · 3Comments

?? What's the problem

kaaier · 3Comments

Raise ValueError: Type mismatch (<type 'str'> vs. <type 'tuple'>) with values (coco_2017_train vs. ('coco_2017_train',)) for config key: DATASETS.TRAIN

SkeletonOne · 3Comments

Get 0 AP and AR when testing, and the inference result is very bad.

KuribohG · 3Comments