Mmdetection: Use dist_train.py at the same time

Created on 21 Feb 2020 · 4Comments · Source: open-mmlab/mmdetection

i have run the dist_train.py to train a model, when i want to train the other model at the same time by running dist_train.py, it has a runtime error.

Traceback (most recent call last):
  File "./tools/train.py", line 130, in <module>
    main()
  File "./tools/train.py", line 83, in main
    init_dist(args.launcher, **cfg.dist_params)
  File "/disk1/NiCholas/anaconda3/envs/nick_rs/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 15, in init_dist
    _init_dist_pytorch(backend, **kwargs)
  File "/disk1/NiCholas/anaconda3/envs/nick_rs/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 29, in _init_dist_pytorch
    dist.init_process_group(backend=backend, **kwargs)
  File "/disk1/NiCholas/anaconda3/envs/nick_rs/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 400, in init_process_group
    store, rank, world_size = next(rendezvous(url))
  File "/disk1/NiCholas/anaconda3/envs/nick_rs/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 143, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon)
RuntimeError: Address already in use
Traceback (most recent call last):
  File "/disk1/NiCholas/anaconda3/envs/nick_rs/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/disk1/NiCholas/anaconda3/envs/nick_rs/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/disk1/NiCholas/anaconda3/envs/nick_rs/lib/python3.7/site-packages/torch/distributed/launch.py", line 253, in <module>
    main()
  File "/disk1/NiCholas/anaconda3/envs/nick_rs/lib/python3.7/site-packages/torch/distributed/launch.py", line 249, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/disk1/NiCholas/anaconda3/envs/nick_rs/bin/python', '-u', './tools/train.py', '--local_rank=0', 'configs/RS-data/Faster-RCNN/trainval/191223/faster_rcnn_r50_20191223.py', '--launcher', 'pytorch']' returned non-zero exit status 1.

Source

NicholasIrving

All 4 comments

Please refer to the documentation.

hellock on 21 Feb 2020

👍1

it can't solve the problem

CUDA_VISIBLE_DEVICES=1 PORT=30000 ./tools/dist_train.sh   configs/RS-data/Faster-RCNN/trainval/191223/faster_rcnn_r101_20191223.py 1   
Traceback (most recent call last):
  File "./tools/train.py", line 130, in <module>
    main()
  File "./tools/train.py", line 83, in main
    init_dist(args.launcher, **cfg.dist_params)
  File "/disk1/NiCholas/anaconda3/envs/nick_rs/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 15, in init_dist
    _init_dist_pytorch(backend, **kwargs)
  File "/disk1/NiCholas/anaconda3/envs/nick_rs/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 29, in _init_dist_pytorch
    dist.init_process_group(backend=backend, **kwargs)
  File "/disk1/NiCholas/anaconda3/envs/nick_rs/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 400, in init_process_group
    store, rank, world_size = next(rendezvous(url))
  File "/disk1/NiCholas/anaconda3/envs/nick_rs/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 143, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon)
RuntimeError: Address already in use
Traceback (most recent call last):
  File "/disk1/NiCholas/anaconda3/envs/nick_rs/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/disk1/NiCholas/anaconda3/envs/nick_rs/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/disk1/NiCholas/anaconda3/envs/nick_rs/lib/python3.7/site-packages/torch/distributed/launch.py", line 253, in <module>
    main()
  File "/disk1/NiCholas/anaconda3/envs/nick_rs/lib/python3.7/site-packages/torch/distributed/launch.py", line 249, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/disk1/NiCholas/anaconda3/envs/nick_rs/bin/python', '-u', './tools/train.py', '--local_rank=0', 'configs/RS-data/Faster-RCNN/trainval/191223/faster_rcnn_r101_20191223.py', '--launcher', 'pytorch']' returned non-zero exit status 1.

NicholasIrving on 22 Feb 2020

Maybe the PORT is occupied by another process.
You could try other PORTs. Or stop all running processes first, then try to launch multiple dist_train.sh at the same time with different PORTs.

xvjiarui on 22 Feb 2020

👍1

Maybe you can refer to this issue. #261
For me changing PORT variable doesn't works either, while adding --master_port 29501 in dist_train.sh solves the problem.

feiyangsuo on 21 Mar 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

suggestion for supporting cpu inference

happog · 3Comments

I want to test my dataset with CPU and meet an error.

fatLime · 3Comments

VOC2007 train SSD512 AssertionError:classification scores become infinite or NaN!

qifei123 · 3Comments

Deformable cnn

hust-kevin · 3Comments

RuntimeError: copy_if failed to synchronize: cudaErrorAssert: device-side assert triggered terminate called after throwing an instance of 'c10::Error'

Hemantr05 · 3Comments