Faster-rcnn.pytorch: resume the training for pytorch 1.0

Created on 11 Apr 2019 · 5Comments · Source: jwyang/faster-rcnn.pytorch

When I use the pytorch1.0 branch, it can train for pascal VOC dataset. But when I break the training and resume from the previous model, I got the RuntimeError. Did anyone have this problem? The error are described as followed:

Loaded dataset voc_2007_trainval for training
Set proposal method: gt
Appending horizontally-flipped training examples...
voc_2007_trainval gt roidb loaded from /home/user02/notebook/faster-rcnn.pytorch-pytorch-1.0/data/cache/voc_2007_trainval_gt_roidb.pkl
done
Preparing training data...
done
before filtering, there are 10022 images...
after filtering, there are 10022 images...
10022 roidb entries
Loading pretrained weights from data/pretrained_model/vgg16_caffe.pth
loading checkpoint models/vgg16/pascal_voc/faster_rcnn_1_3_10021.pth
loaded checkpoint models/vgg16/pascal_voc/faster_rcnn_1_3_10021.pth
Traceback (most recent call last):
File "trainval_net.py", line 355, in
optimizer.step()
File "/usr/local/lib/python3.5/dist-packages/torch/optim/sgd.py", line 101, in step
buf.mul_(momentum).add_(1 - dampening, d_p)
RuntimeError: The size of tensor a (512) must match the size of tensor b (18) at non-singleton dimension 0

Source

Codermay

Most helpful comment

This might be a stretch, but you could try this:
https://github.com/jwyang/faster-rcnn.pytorch/issues/475#issuecomment-483243293

AlexanderHustinx on 17 Apr 2019

😄1 👎1 👍1

All 5 comments

I haven't encountered this specific issue, but issues with the optimizer when resuming training could be related to only a few things.

Could you include the command you're using to train the model, it might shed some light on the issue.
e.g. Are you resuming training with the same batch size as when you started?

AlexanderHustinx on 17 Apr 2019

@AlexanderHustinx Thanks for your reply. I trained with the same batch size as the started trainning. And the command that I used is listing as follows:
CUDA_VISIBLE_DEVICES=2 python3 trainval_net.py --dataset pascal_voc --net vgg16 --bs 1 --nw 0 --lr 0.01 --lr_decay_step 5 --r True --checksession 1 --checkepoch 3 --checkpoint 10021 --use_tfb --cuda
I hope you can solve this problem

Codermay on 17 Apr 2019

This might be a stretch, but you could try this:
https://github.com/jwyang/faster-rcnn.pytorch/issues/475#issuecomment-483243293

AlexanderHustinx on 17 Apr 2019

😄1 👎1 👍1

@AlexanderHustinx I have tried your solutions. That is working. Thanks for your help. And I also have some questions. Do you use the pytorch 1.0 branch code? After moving these two lines that you said, did the model performs well? Are the results consistent with those listed by the authors?

Codermay on 17 Apr 2019

Great, happy to see it worked for you!

I'm using the PyTorch-1.0 branch as well, there is an inconsistency when using the listed trained models, as it appears there was a slight change in PyTorch-1.0 compared to PyTorch-0.4.0, not sure what exactly, but I believe it is listed in one of the Issues in this repo.
When training the model myself the results are rather consistent, but there is a slight difference in performance (0.5~2.0 mAP) for me.

AlexanderHustinx on 17 Apr 2019

Was this page helpful?

0 / 5 - 0 ratings