When I use the pytorch1.0 branch, it can train for pascal VOC dataset. But when I break the training and resume from the previous model, I got the RuntimeError. Did anyone have this problem? The error are described as followed:
Loaded dataset voc_2007_trainval for training
Set proposal method: gt
Appending horizontally-flipped training examples...
voc_2007_trainval gt roidb loaded from /home/user02/notebook/faster-rcnn.pytorch-pytorch-1.0/data/cache/voc_2007_trainval_gt_roidb.pkl
done
Preparing training data...
done
before filtering, there are 10022 images...
after filtering, there are 10022 images...
10022 roidb entries
Loading pretrained weights from data/pretrained_model/vgg16_caffe.pth
loading checkpoint models/vgg16/pascal_voc/faster_rcnn_1_3_10021.pth
loaded checkpoint models/vgg16/pascal_voc/faster_rcnn_1_3_10021.pth
Traceback (most recent call last):
File "trainval_net.py", line 355, in
optimizer.step()
File "/usr/local/lib/python3.5/dist-packages/torch/optim/sgd.py", line 101, in step
buf.mul_(momentum).add_(1 - dampening, d_p)
RuntimeError: The size of tensor a (512) must match the size of tensor b (18) at non-singleton dimension 0
I haven't encountered this specific issue, but issues with the optimizer when resuming training could be related to only a few things.
Could you include the command you're using to train the model, it might shed some light on the issue.
e.g. Are you resuming training with the same batch size as when you started?
@AlexanderHustinx Thanks for your reply. I trained with the same batch size as the started trainning. And the command that I used is listing as follows:
CUDA_VISIBLE_DEVICES=2 python3 trainval_net.py --dataset pascal_voc --net vgg16 --bs 1 --nw 0 --lr 0.01 --lr_decay_step 5 --r True --checksession 1 --checkepoch 3 --checkpoint 10021 --use_tfb --cuda
I hope you can solve this problem
This might be a stretch, but you could try this:
https://github.com/jwyang/faster-rcnn.pytorch/issues/475#issuecomment-483243293
@AlexanderHustinx I have tried your solutions. That is working. Thanks for your help. And I also have some questions. Do you use the pytorch 1.0 branch code? After moving these two lines that you said, did the model performs well? Are the results consistent with those listed by the authors?
Great, happy to see it worked for you!
I'm using the PyTorch-1.0 branch as well, there is an inconsistency when using the listed trained models, as it appears there was a slight change in PyTorch-1.0 compared to PyTorch-0.4.0, not sure what exactly, but I believe it is listed in one of the Issues in this repo.
When training the model myself the results are rather consistent, but there is a slight difference in performance (0.5~2.0 mAP) for me.
Most helpful comment
This might be a stretch, but you could try this:
https://github.com/jwyang/faster-rcnn.pytorch/issues/475#issuecomment-483243293