i have run the dist_train.py to train a model, when i want to train the other model at the same time by running dist_train.py, it has a runtime error.
Traceback (most recent call last):
File "./tools/train.py", line 130, in <module>
main()
File "./tools/train.py", line 83, in main
init_dist(args.launcher, **cfg.dist_params)
File "/disk1/NiCholas/anaconda3/envs/nick_rs/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 15, in init_dist
_init_dist_pytorch(backend, **kwargs)
File "/disk1/NiCholas/anaconda3/envs/nick_rs/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 29, in _init_dist_pytorch
dist.init_process_group(backend=backend, **kwargs)
File "/disk1/NiCholas/anaconda3/envs/nick_rs/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 400, in init_process_group
store, rank, world_size = next(rendezvous(url))
File "/disk1/NiCholas/anaconda3/envs/nick_rs/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 143, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon)
RuntimeError: Address already in use
Traceback (most recent call last):
File "/disk1/NiCholas/anaconda3/envs/nick_rs/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/disk1/NiCholas/anaconda3/envs/nick_rs/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/disk1/NiCholas/anaconda3/envs/nick_rs/lib/python3.7/site-packages/torch/distributed/launch.py", line 253, in <module>
main()
File "/disk1/NiCholas/anaconda3/envs/nick_rs/lib/python3.7/site-packages/torch/distributed/launch.py", line 249, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/disk1/NiCholas/anaconda3/envs/nick_rs/bin/python', '-u', './tools/train.py', '--local_rank=0', 'configs/RS-data/Faster-RCNN/trainval/191223/faster_rcnn_r50_20191223.py', '--launcher', 'pytorch']' returned non-zero exit status 1.
Please refer to the documentation.
it can't solve the problem
CUDA_VISIBLE_DEVICES=1 PORT=30000 ./tools/dist_train.sh configs/RS-data/Faster-RCNN/trainval/191223/faster_rcnn_r101_20191223.py 1
Traceback (most recent call last):
File "./tools/train.py", line 130, in <module>
main()
File "./tools/train.py", line 83, in main
init_dist(args.launcher, **cfg.dist_params)
File "/disk1/NiCholas/anaconda3/envs/nick_rs/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 15, in init_dist
_init_dist_pytorch(backend, **kwargs)
File "/disk1/NiCholas/anaconda3/envs/nick_rs/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 29, in _init_dist_pytorch
dist.init_process_group(backend=backend, **kwargs)
File "/disk1/NiCholas/anaconda3/envs/nick_rs/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 400, in init_process_group
store, rank, world_size = next(rendezvous(url))
File "/disk1/NiCholas/anaconda3/envs/nick_rs/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 143, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon)
RuntimeError: Address already in use
Traceback (most recent call last):
File "/disk1/NiCholas/anaconda3/envs/nick_rs/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/disk1/NiCholas/anaconda3/envs/nick_rs/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/disk1/NiCholas/anaconda3/envs/nick_rs/lib/python3.7/site-packages/torch/distributed/launch.py", line 253, in <module>
main()
File "/disk1/NiCholas/anaconda3/envs/nick_rs/lib/python3.7/site-packages/torch/distributed/launch.py", line 249, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/disk1/NiCholas/anaconda3/envs/nick_rs/bin/python', '-u', './tools/train.py', '--local_rank=0', 'configs/RS-data/Faster-RCNN/trainval/191223/faster_rcnn_r101_20191223.py', '--launcher', 'pytorch']' returned non-zero exit status 1.
Maybe the PORT is occupied by another process.
You could try other PORTs. Or stop all running processes first, then try to launch multiple dist_train.sh at the same time with different PORTs.
Maybe you can refer to this issue. #261
For me changing PORT variable doesn't works either, while adding --master_port 29501 in dist_train.sh solves the problem.