Describe the bug
The 'tools/train.py' works without any bugs. However, the error message is shown when I run 'tools/dist_train.py': RuntimeError: all tensors must be on devices[0]
Error traceback
Traceback (most recent call last):
File "./tools/train.py", line 124, in
main()
File "./tools/train.py", line 120, in main
timestamp=timestamp)
File "/disk1/zzw/works/mmdetection/mmdet/apis/train.py", line 125, in train_detector
timestamp=timestamp)
File "/disk1/zzw/works/mmdetection/mmdet/apis/train.py", line 230, in _dist_train
model = MMDistributedDataParallel(model.cuda())
File "/disk1/zzw/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 305, in __init__
self._ddp_init_helper()
File "/disk1/zzw/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 323, in _ddp_init_helper
self._module_copies = replicate(self.module, self.device_ids, detach=True)
File "/disk1/zzw/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/replicate.py", line 88, in replicate
param_copies = _broadcast_coalesced_reshape(params, devices, detach)
File "/disk1/zzw/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/replicate.py", line 67, in _broadcast_coalesced_reshape
return comm.broadcast_coalesced(tensors, devices)
File "/disk1/zzw/anaconda3/lib/python3.7/site-packages/torch/cuda/comm.py", line 39, in broadcast_coalesced
return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: all tensors must be on devices[0]
Traceback (most recent call last):
File "./tools/train.py", line 124, in
main()
File "./tools/train.py", line 120, in main
timestamp=timestamp)
File "/disk1/zzw/works/mmdetection/mmdet/apis/train.py", line 125, in train_detector
timestamp=timestamp)
File "/disk1/zzw/works/mmdetection/mmdet/apis/train.py", line 263, in _dist_train
CocoDistEvalmAPHook(val_dataset_cfg, **eval_cfg))
TypeError: __init__() got an unexpected keyword argument 'metric'
^CTraceback (most recent call last):
File "/disk1/zzw/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/disk1/zzw/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/disk1/zzw/anaconda3/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in
main()
File "/disk1/zzw/anaconda3/lib/python3.7/site-packages/torch/distributed/launch.py", line 256, in main
process.wait()
File "/disk1/zzw/anaconda3/lib/python3.7/subprocess.py", line 990, in wait
return self._wait(timeout=timeout)
File "/disk1/zzw/anaconda3/lib/python3.7/subprocess.py", line 1624, in _wait
(pid, sts) = self._try_wait(0)
File "/disk1/zzw/anaconda3/lib/python3.7/subprocess.py", line 1582, in _try_wait
(pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt
Environment
sys.platform: linux
Python: 3.7.4 (default, Aug 13 2019, 20:35:49) [GCC 7.3.0]
CUDA available: True
CUDA_HOME: /usr/local/cuda-9.0
NVCC: Cuda compilation tools, release 9.0, V9.0.176
GPU 0,1,2,3: GeForce GTX 1080 Ti
GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
PyTorch: 1.4.0
PyTorch compiling details: PyTorch built with:
TorchVision: 0.5.0
OpenCV: 4.2.0
MMCV: 0.3.1
MMDetection: 1.0.0+923b70a
MMDetection Compiler: GCC 5.4
MMDetection CUDA Compiler: 9.0
Please help me fix the bug ,THANK YOU!
Hi @GaryZhu1996
Please update your mmdetection to the latest version.
Hi @GaryZhu1996
Please update your mmdetection to the latest version.
I have modified the code following my requirement. So I don't expect to re-download this project. It is not an easy task to re-edit on a new version. Therefore, could you please tell me some principles and details of the solution?
If it's too hard to merge your code with the current master branch, a hotfix could be install mmcv==0.2.16.
If it's too hard to merge your code with the current master branch, a hotfix could be install mmcv==0.2.16.
OK! I will try it later. Thanks for your help.
Most helpful comment
If it's too hard to merge your code with the current master branch, a hotfix could be install mmcv==0.2.16.