Mmdetection: RuntimeError: all tensors must be on devices[0]

Created on 15 Mar 2020 · 4Comments · Source: open-mmlab/mmdetection

Describe the bug
The 'tools/train.py' works without any bugs. However, the error message is shown when I run 'tools/dist_train.py': RuntimeError: all tensors must be on devices[0]

Error traceback
Traceback (most recent call last):
File "./tools/train.py", line 124, in
main()
File "./tools/train.py", line 120, in main
timestamp=timestamp)
File "/disk1/zzw/works/mmdetection/mmdet/apis/train.py", line 125, in train_detector
timestamp=timestamp)
File "/disk1/zzw/works/mmdetection/mmdet/apis/train.py", line 230, in _dist_train
model = MMDistributedDataParallel(model.cuda())
File "/disk1/zzw/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 305, in __init__
self._ddp_init_helper()
File "/disk1/zzw/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 323, in _ddp_init_helper
self._module_copies = replicate(self.module, self.device_ids, detach=True)
File "/disk1/zzw/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/replicate.py", line 88, in replicate
param_copies = _broadcast_coalesced_reshape(params, devices, detach)
File "/disk1/zzw/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/replicate.py", line 67, in _broadcast_coalesced_reshape
return comm.broadcast_coalesced(tensors, devices)
File "/disk1/zzw/anaconda3/lib/python3.7/site-packages/torch/cuda/comm.py", line 39, in broadcast_coalesced
return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: all tensors must be on devices[0]
Traceback (most recent call last):
File "./tools/train.py", line 124, in
main()
File "./tools/train.py", line 120, in main
timestamp=timestamp)
File "/disk1/zzw/works/mmdetection/mmdet/apis/train.py", line 125, in train_detector
timestamp=timestamp)
File "/disk1/zzw/works/mmdetection/mmdet/apis/train.py", line 263, in _dist_train
CocoDistEvalmAPHook(val_dataset_cfg, **eval_cfg))
TypeError: __init__() got an unexpected keyword argument 'metric'
^CTraceback (most recent call last):
File "/disk1/zzw/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/disk1/zzw/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/disk1/zzw/anaconda3/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in
main()
File "/disk1/zzw/anaconda3/lib/python3.7/site-packages/torch/distributed/launch.py", line 256, in main
process.wait()
File "/disk1/zzw/anaconda3/lib/python3.7/subprocess.py", line 990, in wait
return self._wait(timeout=timeout)
File "/disk1/zzw/anaconda3/lib/python3.7/subprocess.py", line 1624, in _wait
(pid, sts) = self._try_wait(0)
File "/disk1/zzw/anaconda3/lib/python3.7/subprocess.py", line 1582, in _try_wait
(pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt

Environment
sys.platform: linux
Python: 3.7.4 (default, Aug 13 2019, 20:35:49) [GCC 7.3.0]
CUDA available: True
CUDA_HOME: /usr/local/cuda-9.0
NVCC: Cuda compilation tools, release 9.0, V9.0.176
GPU 0,1,2,3: GeForce GTX 1080 Ti
GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
PyTorch: 1.4.0
PyTorch compiling details: PyTorch built with:

GCC 7.3
Intel(R) Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
OpenMP 201511 (a.k.a. OpenMP 4.5)
NNPACK is enabled
CUDA Runtime 10.1
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
CuDNN 7.6.3
Magma 2.5.1
Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

TorchVision: 0.5.0
OpenCV: 4.2.0
MMCV: 0.3.1
MMDetection: 1.0.0+923b70a
MMDetection Compiler: GCC 5.4
MMDetection CUDA Compiler: 9.0

Please help me fix the bug ,THANK YOU!

Source

GaryZhu1996

Most helpful comment

If it's too hard to merge your code with the current master branch, a hotfix could be install mmcv==0.2.16.

xvjiarui on 15 Mar 2020

👍4 😄2

All 4 comments

Hi @GaryZhu1996
Please update your mmdetection to the latest version.

xvjiarui on 15 Mar 2020

Hi @GaryZhu1996
Please update your mmdetection to the latest version.

I have modified the code following my requirement. So I don't expect to re-download this project. It is not an easy task to re-edit on a new version. Therefore, could you please tell me some principles and details of the solution?

GaryZhu1996 on 15 Mar 2020

If it's too hard to merge your code with the current master branch, a hotfix could be install mmcv==0.2.16.

xvjiarui on 15 Mar 2020

👍4 😄2