Hello, I successfully installed the mmdet and compiled well.
I can train the model with 1 GPU.
However, the runtime error occurs when I try to train the model with multiple GPUs.
The error looks like:
Traceback (most recent call last):
File "tools/train.py", line 126, in
main()
File "tools/train.py", line 122, in main
timestamp=timestamp)
File "/home/ubuntu/eff_panoptic/mmdet/apis/train.py", line 125, in train_detector
timestamp=timestamp)
File "/home/ubuntu/eff_panoptic/mmdet/apis/train.py", line 230, in _dist_train
model = MMDistributedDataParallel(model.cuda())
File "/home/ubuntu/anaconda3/envs/mmdet/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 305, in __init__
self._ddp_init_helper()
File "/home/ubuntu/anaconda3/envs/mmdet/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 323, in _ddp_init_helper
self._module_copies = replicate(self.module, self.device_ids, detach=True)
File "/home/ubuntu/anaconda3/envs/mmdet/lib/python3.7/site-packages/torch/nn/parallel/replicate.py", line 88, in replicate
param_copies = _broadcast_coalesced_reshape(params, devices, detach)
File "/home/ubuntu/anaconda3/envs/mmdet/lib/python3.7/site-packages/torch/nn/parallel/replicate.py", line 67, in _broadcast_coalesced_reshape
return comm.broadcast_coalesced(tensors, devices)
File "/home/ubuntu/anaconda3/envs/mmdet/lib/python3.7/site-packages/torch/cuda/comm.py", line 39, in broadcast_coalesced
return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: all tensors must be on devices[0]
My environment looks like:
sys.platform: linux
Python: 3.7.6 (default, Jan 8 2020, 19:59:22) [GCC 7.3.0]
CUDA available: True
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 10.1, V10.1.105
GPU 0,1,2,3,4,5,6,7: Tesla V100-SXM2-16GB
GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
PyTorch: 1.4.0
PyTorch compiling details: PyTorch built with:
TorchVision: 0.5.0
OpenCV: 4.2.0
MMCV: 0.3.1
MMDetection: 1.0rc1+d7f86ee
MMDetection Compiler: GCC 5.4
MMDetection CUDA Compiler: 10.1
What is your running script or command?
The command I used is usual:
tools/dist_train.sh [config_file] [num_gpus]
Try use mmcv 0.2.15
You are not using the latest code. Please upgrade to the latest mmdetection.
Thanks, I resolved the issue by upgrading both mmdetection and mmcv to the latest version.
Most helpful comment
Try use mmcv 0.2.15