I have 4 GPU on my machine, running training with
--dataset pascal_voc --net res101 --bs 8 --nw 4 --lr 4e-3 --lr_decay_step 8 --cuda --mGPUs
but get error:
Loading pretrained weights from data/pretrained_model/resnet101_caffe.pth
/home/user/anaconda2/envs/tensorflow/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py:24: UserWarning:
There is an imbalance between your GPUs. You may want to exclude GPU 0 which
has less than 75% of the memory or cores of GPU 1. You can do so by setting
the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
environment variable.
warnings.warn(imbalance_warn.format(device_ids[min_pos], device_ids[max_pos]))
/home/user/prj/pytorch-faster-rcnn/lib/model/rpn/rpn.py:68: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
rpn_cls_prob_reshape = F.softmax(rpn_cls_score_reshape)
/home/user/prj/pytorch-faster-rcnn/lib/model/faster_rcnn/faster_rcnn.py:98: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
cls_prob = F.softmax(cls_score)
Traceback (most recent call last):
File "/home/user/.pycharm_helpers/pydev/pydevd.py", line 1664, in <module>
main()
File "/home/user/.pycharm_helpers/pydev/pydevd.py", line 1658, in main
globals = debugger.run(setup['file'], None, None, is_module)
File "/home/user/.pycharm_helpers/pydev/pydevd.py", line 1068, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "/home/user/prj/pytorch-faster-rcnn/trainval_net.py", line 323, in <module>
rois_label = fasterRCNN(im_data, im_info, gt_boxes, num_boxes)
File "/home/user/anaconda2/envs/tensorflow/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in __call__
result = self.forward(*input, **kwargs)
File "/home/user/anaconda2/envs/tensorflow/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 115, in forward
return self.gather(outputs, self.output_device)
File "/home/user/anaconda2/envs/tensorflow/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 127, in gather
return gather(outputs, output_device, dim=self.dim)
File "/home/user/anaconda2/envs/tensorflow/lib/python2.7/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
return gather_map(outputs)
File "/home/user/anaconda2/envs/tensorflow/lib/python2.7/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
File "/home/user/anaconda2/envs/tensorflow/lib/python2.7/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
return Gather.apply(target_device, dim, *outputs)
File "/home/user/anaconda2/envs/tensorflow/lib/python2.7/site-packages/torch/nn/parallel/_functions.py", line 54, in forward
ctx.input_sizes = tuple(map(lambda i: i.size(ctx.dim), inputs))
File "/home/user/anaconda2/envs/tensorflow/lib/python2.7/site-packages/torch/nn/parallel/_functions.py", line 54, in <lambda>
ctx.input_sizes = tuple(map(lambda i: i.size(ctx.dim), inputs))
RuntimeError: dimension specified as 0 but tensor has no dimensions
Are you using Pytorch 0.4? Mine also crashed when using Pytorch 0.4 with multiple GPUs. https://github.com/pytorch/pytorch/issues/5552
from the post you linked, seems it has been fixed and merged, however with the newest pytorch, I still get the error
Same problem here. I'm not too sure how to fix this yet. using pytorch 0.4
I just fixed this problem by unsqueezing RCNN_loss_cls, RCNN_loss_bbox, rpn_loss_cls, rpn_loss_cls in lib/model/faster_rcnn/faster_rcnn.py. Basically, scalar tensor in Pytorch 0.4 caused the error so you need to add one more dimension: rpn_loss_cls = torch.unsqueeze(rpn_loss_cls, 0) ... BTW I compiled Pytorch 0.4 from the source but I think it should also work if you install from conda.
@wtl-zju thank you. Works. using python3 with pytorch 0.4 in virtualenv.
Slight error in @wtl-zju.
To clarify, add these lines just before returning the values in lib/model/faster_rcnn/faster_rcnn.py
if self.training:
rpn_loss_cls = torch.unsqueeze(rpn_loss_cls, 0)
rpn_loss_bbox = torch.unsqueeze(rpn_loss_bbox, 0)
RCNN_loss_cls = torch.unsqueeze(RCNN_loss_cls, 0)
RCNN_loss_bbox = torch.unsqueeze(RCNN_loss_bbox, 0)
it is placed in the self.training as it shouldn't be training these when testing / predicting. Additionally, the variable is set to 0 which can be seen a few lines above the code.
Most helpful comment
@wtl-zju thank you. Works. using python3 with pytorch 0.4 in virtualenv.
Slight error in @wtl-zju.
To clarify, add these lines just before returning the values in lib/model/faster_rcnn/faster_rcnn.py
if self.training:
rpn_loss_cls = torch.unsqueeze(rpn_loss_cls, 0)
rpn_loss_bbox = torch.unsqueeze(rpn_loss_bbox, 0)
RCNN_loss_cls = torch.unsqueeze(RCNN_loss_cls, 0)
RCNN_loss_bbox = torch.unsqueeze(RCNN_loss_bbox, 0)
it is placed in the self.training as it shouldn't be training these when testing / predicting. Additionally, the variable is set to 0 which can be seen a few lines above the code.