Maskrcnn-benchmark: Cannot run multiple training processes simultaneously on same GPU server?

Created on 3 Dec 2018  路  9Comments  路  Source: facebookresearch/maskrcnn-benchmark

If I try to run two training processes, each used 4 GPUs, simultaneously on 8 GPUs server, I will meet this error:

RuntimeError: Address already in use at /opt/conda/conda-bld/pytorch-nightly_1543051141017/work/torch/lib/THD/process_group/General.cpp:20

Don't know how to solve it.

question

Most helpful comment

Can you try specifying a different master_addr and master_port in torch.distributed.launch?

CUDA_VISIBLE_DEVICES=${GPU_ID} python -m torch.distributed.launch --nproc_per_node=$NGPUS --master_addr 127.0.0.2 --master_port 29501 tools/train_net.py 

All 9 comments

Hi @xuw080
I have the same problem before, and I solve it by setting CUDA_VISIBLE_DEVICES
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch ...
CUDA_VISIBLE_DEVICES=4,5,6,7python -m torch.distributed.launch ...

You could check https://github.com/facebookresearch/maskrcnn-benchmark/issues/80 for more detail.

Thank you for you reply. In fact, this is exactly what I did for launching two multi-gpus process.

I use this scripts for launching the first one:
NGPUS=4 \
GPU_ID=0,1,2,3
CUDA_VISIBLE_DEVICES=${GPU_ID} python -m torch.distributed.launch --nproc_per_node=$NGPUS tools/train_net.py \
--config-file "configs/e2e_faster_rcnn_R_50_C4_1x.yaml"

and this for second one:

NGPUS=4 \
GPU_ID=4,5,6,7
CUDA_VISIBLE_DEVICES=${GPU_ID} python -m torch.distributed.launch --nproc_per_node=$NGPUS tools/train_net.py \
--config-file "configs/e2e_faster_rcnn_R_50_C4_1x.yaml"

When I was launching the second one, this error will appear:

Traceback (most recent call last):
File "tools/train_net.py", line 197, in
main()
File "tools/train_net.py", line 173, in main
backend="nccl", init_method="env://"
File "/home/Xwang/anaconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/distributed/deprecated/__init__.py", line 101, in init_process_group
group_name, rank)
RuntimeError: Address already in use at /opt/conda/conda-bld/pytorch-nightly_1543051141017/work/torch/lib/THD/process_group/General.cpp:20

It seems to be because the second process try to use the same address for launching the process.

Anyone know how to solve it? Thank you so much!

Can you try specifying a different master_addr and master_port in torch.distributed.launch?

CUDA_VISIBLE_DEVICES=${GPU_ID} python -m torch.distributed.launch --nproc_per_node=$NGPUS --master_addr 127.0.0.2 --master_port 29501 tools/train_net.py 

Works now, thank you so much!

I am having a similar issues, although instead of Address already in use it's related to public_address was not set in config.

I have looked around a bit but am unable to solve this. Any idea what could be wrong/what else to look into?

If I run

python3 -m torch.distributed.launch --nproc_per_node 2 --master_addr 127.0.0.2 --master_port 29501 relational_rxn_graphs/detector/train.py --config-file configs/detector/e2e_faster_rcnn_R_101_FPN_1x.yaml

Where relational_rxn_graphs/detector/train.py is (with minor modifications) tools/train_net.py on a custom dataset

I get the following error error message:

Traceback (most recent call last):
  File "relational_rxn_graphs/detector/train.py", line 181, in <module>
    main()
  File "relational_rxn_graphs/detector/train.py", line 151, in main
    backend="nccl", init_method="env://"
  File "/u/ial/.local/deeplearning/pytorch-master/lib64/python3.5/site-packages/torch/distributed/deprecated/__init__.py", line 101, in init_process_group
    group_name, rank)
RuntimeError: public_address was not set in config at /ibm/gpfs-homes/ial/.local/tmp_compilation/pytorch-master-at10.0/pytorch/torch/lib/THD/process_group/General.cpp:20

There is probably a problem with your settings, where the environment variable MASTER_ADDR is not being set.
If you print os.environ.get('MASTER_ADDR') in the main script, it will probably be empty, indicating that the problem is there.

I'm not sure where else to check, there might be other things that have been changed in your codebase that might be affecting it (maybe you clean all env vars in your script?)

Yes, it is working now, somehow os.environ.get('MASTER_ADDR') was being overwritten in my code. Everything is working now! Thank you so much : )

@fmassa Hello, I can use 4 gpus to run it. But I get another following error:
Traceback (most recent call last): File "tools/train_net.py", line 176, in <module> main() File "tools/train_net.py", line 169, in main model = train(cfg, args.local_rank, args.distributed) File "tools/train_net.py", line 65, in train start_iter=arguments["iteration"], File "/home/chh/Code/github_maskscoring_rcnn-master/maskscoring_rcnn/maskrcnn_benchmark/data/build.py", line 158, in make_data_loader sampler = make_data_sampler(dataset, shuffle, is_distributed) File "/home/chh/Code/github_maskscoring_rcnn-master/maskscoring_rcnn/maskrcnn_benchmark/data/build.py", line 61, in make_data_sampler return samplers.DistributedSampler(dataset, shuffle=shuffle) File "/home/chh/Code/github_maskscoring_rcnn-master/maskscoring_rcnn/maskrcnn_benchmark/data/samplers/distributed.py", line 29, in __init__ num_replicas = dist.get_world_size() File "/home/chh/anaconda2/envs/maskrcnn_benchmark/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 519, in get_world_size if _rank_not_in_group(group): File "/home/chh/anaconda2/envs/maskrcnn_benchmark/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 132, in _rank_not_in_group default_backend, _ = _pg_map[_get_default_group()] File "/home/chh/anaconda2/envs/maskrcnn_benchmark/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 258, in _get_default_group raise RuntimeError("Default process group has not been initialized, " RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
Wait for your reply, thank you!

@maomaochongchh hi,I met the same problem as you, have you solved it ?

Was this page helpful?
0 / 5 - 0 ratings