Mmdetection: RuntimeError: all tensors must be on devices[0]

Created on 15 Mar 2020  路  4Comments  路  Source: open-mmlab/mmdetection

Describe the bug
The 'tools/train.py' works without any bugs. However, the error message is shown when I run 'tools/dist_train.py': RuntimeError: all tensors must be on devices[0]

Error traceback
Traceback (most recent call last):
File "./tools/train.py", line 124, in
main()
File "./tools/train.py", line 120, in main
timestamp=timestamp)
File "/disk1/zzw/works/mmdetection/mmdet/apis/train.py", line 125, in train_detector
timestamp=timestamp)
File "/disk1/zzw/works/mmdetection/mmdet/apis/train.py", line 230, in _dist_train
model = MMDistributedDataParallel(model.cuda())
File "/disk1/zzw/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 305, in __init__
self._ddp_init_helper()
File "/disk1/zzw/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 323, in _ddp_init_helper
self._module_copies = replicate(self.module, self.device_ids, detach=True)
File "/disk1/zzw/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/replicate.py", line 88, in replicate
param_copies = _broadcast_coalesced_reshape(params, devices, detach)
File "/disk1/zzw/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/replicate.py", line 67, in _broadcast_coalesced_reshape
return comm.broadcast_coalesced(tensors, devices)
File "/disk1/zzw/anaconda3/lib/python3.7/site-packages/torch/cuda/comm.py", line 39, in broadcast_coalesced
return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: all tensors must be on devices[0]
Traceback (most recent call last):
File "./tools/train.py", line 124, in
main()
File "./tools/train.py", line 120, in main
timestamp=timestamp)
File "/disk1/zzw/works/mmdetection/mmdet/apis/train.py", line 125, in train_detector
timestamp=timestamp)
File "/disk1/zzw/works/mmdetection/mmdet/apis/train.py", line 263, in _dist_train
CocoDistEvalmAPHook(val_dataset_cfg, **eval_cfg))
TypeError: __init__() got an unexpected keyword argument 'metric'
^CTraceback (most recent call last):
File "/disk1/zzw/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/disk1/zzw/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/disk1/zzw/anaconda3/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in
main()
File "/disk1/zzw/anaconda3/lib/python3.7/site-packages/torch/distributed/launch.py", line 256, in main
process.wait()
File "/disk1/zzw/anaconda3/lib/python3.7/subprocess.py", line 990, in wait
return self._wait(timeout=timeout)
File "/disk1/zzw/anaconda3/lib/python3.7/subprocess.py", line 1624, in _wait
(pid, sts) = self._try_wait(0)
File "/disk1/zzw/anaconda3/lib/python3.7/subprocess.py", line 1582, in _try_wait
(pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt

Environment
sys.platform: linux
Python: 3.7.4 (default, Aug 13 2019, 20:35:49) [GCC 7.3.0]
CUDA available: True
CUDA_HOME: /usr/local/cuda-9.0
NVCC: Cuda compilation tools, release 9.0, V9.0.176
GPU 0,1,2,3: GeForce GTX 1080 Ti
GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
PyTorch: 1.4.0
PyTorch compiling details: PyTorch built with:

  • GCC 7.3
  • Intel(R) Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • NNPACK is enabled
  • CUDA Runtime 10.1
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
  • CuDNN 7.6.3
  • Magma 2.5.1
  • Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

TorchVision: 0.5.0
OpenCV: 4.2.0
MMCV: 0.3.1
MMDetection: 1.0.0+923b70a
MMDetection Compiler: GCC 5.4
MMDetection CUDA Compiler: 9.0

Please help me fix the bug ,THANK YOU!

Most helpful comment

If it's too hard to merge your code with the current master branch, a hotfix could be install mmcv==0.2.16.

All 4 comments

Hi @GaryZhu1996
Please update your mmdetection to the latest version.

Hi @GaryZhu1996
Please update your mmdetection to the latest version.

I have modified the code following my requirement. So I don't expect to re-download this project. It is not an easy task to re-edit on a new version. Therefore, could you please tell me some principles and details of the solution?

If it's too hard to merge your code with the current master branch, a hotfix could be install mmcv==0.2.16.

If it's too hard to merge your code with the current master branch, a hotfix could be install mmcv==0.2.16.

OK! I will try it later. Thanks for your help.

Was this page helpful?
0 / 5 - 0 ratings