Faster-rcnn.pytorch: Attempted to restart training on a COCO dataset after 2 epochs... failed with runtime error

Created on 13 Mar 2019  Â·  9Comments  Â·  Source: jwyang/faster-rcnn.pytorch

was training COCO... everything went smoothly and I managed to get into 3rd epoch and paused the training. Went to restart, and got errors. Unsure how to proceed with debugging or cleaning up..

CUDA_VISIBLE_DEVICES=0 python trainval_net.py --dataset coco --net res101 --bs 1 --nw 1 --lr .001 --lr_decay_step 10 --cuda --r true --checksession 1 --checkepoch 2 --checkpoint 234531 --use_tfb
234532 roidb entries
Loading pretrained weights from data/pretrained_model/resnet101_caffe.pth
loading checkpoint models/res101/coco/faster_rcnn_1_2_234531.pth
loaded checkpoint models/res101/coco/faster_rcnn_1_2_234531.pth
Traceback (most recent call last):
  File "trainval_net.py", line 339, in <module>
    optimizer.step()
  File "/python3.6/site-packages/torch/optim/sgd.py", line 101, in step
    buf.mul_(momentum).add_(1 - dampening, d_p)
RuntimeError: expected type torch.FloatTensor but got torch.cuda.FloatTensor
(pytorch1_py36) emcp@k:faster-rcnn.pytorch$ 

Most helpful comment

I ran into a similar problem. For me that issue was solved when moving the lines:

  if args.cuda:
    fasterRCNN.cuda()

above the assignment of the optimizer, i.e. above:

  if args.optimizer == "adam":
    lr = lr * 0.1
    optimizer = torch.optim.Adam(params)

  elif args.optimizer == "sgd":
    optimizer = torch.optim.SGD(params, momentum=cfg.TRAIN.MOMENTUM)

According to the documentation it is best practice to move the model to GPU prior to initialization/assignment of the optimizer.

All 9 comments

I guess you need to convert cpu model to gpu

On Wed, Mar 13, 2019 at 5:52 PM Erik notifications@github.com wrote:

was training COCO... everything went smoothly and I managed to get into
3rd epoch and paused the training. Went to restart, and got errors. Unsure
how to proceed with debugging or cleaning up..

CUDA_VISIBLE_DEVICES=0 python trainval_net.py --dataset coco --net res101 --bs 1 --nw 1 --lr .001 --lr_decay_step 10 --cuda --r true --checksession 1 --checkepoch 2 --checkpoint 234531 --use_tfb

234532 roidb entries
Loading pretrained weights from data/pretrained_model/resnet101_caffe.pth
loading checkpoint models/res101/coco/faster_rcnn_1_2_234531.pth
loaded checkpoint models/res101/coco/faster_rcnn_1_2_234531.pth
Traceback (most recent call last):
File "trainval_net.py", line 339, in
optimizer.step()
File "/python3.6/site-packages/torch/optim/sgd.py", line 101, in step
buf.mul_(momentum).add_(1 - dampening, d_p)
RuntimeError: expected type torch.FloatTensor but got torch.cuda.FloatTensor
(pytorch1_py36) emcp@k:faster-rcnn.pytorch$

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/jwyang/faster-rcnn.pytorch/issues/475, or mute the
thread
https://github.com/notifications/unsubscribe-auth/ADtr5wznoyY2kIsOFC1yY7HQ5I4KeeeBks5vWXMVgaJpZM4bwcqX
.

I ran everything with cuda ON, according to my inputs at the terminal..
will double check though

That model came from this call.. so I am a little lost how it's thinking it is a CPU model

$ CUDA_VISIBLE_DEVICES=0 python trainval_net.py --dataset coco --net res101 --bs 1 --nw 1 --lr .001 --lr_decay_step 10 --cuda

i had paused it manually at epoch 3, did that cause an issue perhaps?

full output of the call that was eventually paused

(pytorch1_py36) emcp@k:/faster-rcnn.private$ CUDA_VISIBLE_DEVICES=0 python trainval_net.py --dataset coco --net res101 --bs 1 --nw 1 --lr .001 --lr_decay_step 10 --cuda
Called with args:
Namespace(batch_size=1, checkepoch=1, checkpoint=0, checkpoint_interval=10000, checksession=1, class_agnostic=False, cuda=True, dataset='coco', disp_interval=100, large_scale=False, lr=0.001, lr_decay_gamma=0.1, lr_decay_step=10, mGPUs=False, max_epochs=20, net='res101', num_workers=1, optimizer='sgd', resume=False, save_dir='models', session=1, start_epoch=1, use_tfboard=False)
Using config:
{'ANCHOR_RATIOS': [0.5, 1, 2],
 'ANCHOR_SCALES': [4, 8, 16, 32],
 'CROP_RESIZE_WITH_MAX_POOL': False,
 'CUDA': False,
 'DATA_DIR': '/faster-rcnn.private/data',
 'DEDUP_BOXES': 0.0625,
 'EPS': 1e-14,
 'EXP_DIR': 'res101',
 'FEAT_STRIDE': [16],
 'GPU_ID': 0,
 'MATLAB': 'matlab',
 'MAX_NUM_GT_BOXES': 50,
 'MOBILENET': {'DEPTH_MULTIPLIER': 1.0,
               'FIXED_LAYERS': 5,
               'REGU_DEPTH': False,
               'WEIGHT_DECAY': 4e-05},
 'PIXEL_MEANS': array([[[102.9801, 115.9465, 122.7717]]]),
 'POOLING_MODE': 'align',
 'POOLING_SIZE': 7,
 'RESNET': {'FIXED_BLOCKS': 1, 'MAX_POOL': False},
 'RNG_SEED': 3,
 'ROOT_DIR': '/faster-rcnn.private',
 'TEST': {'BBOX_REG': True,
          'HAS_RPN': True,
          'MAX_SIZE': 1000,
          'MODE': 'nms',
          'NMS': 0.3,
          'PROPOSAL_METHOD': 'gt',
          'RPN_MIN_SIZE': 16,
          'RPN_NMS_THRESH': 0.7,
          'RPN_POST_NMS_TOP_N': 300,
          'RPN_PRE_NMS_TOP_N': 6000,
          'RPN_TOP_N': 5000,
          'SCALES': [600],
          'SVM': False},
 'TRAIN': {'ASPECT_GROUPING': False,
           'BATCH_SIZE': 128,
           'BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
           'BBOX_NORMALIZE_MEANS': [0.0, 0.0, 0.0, 0.0],
           'BBOX_NORMALIZE_STDS': [0.1, 0.1, 0.2, 0.2],
           'BBOX_NORMALIZE_TARGETS': True,
           'BBOX_NORMALIZE_TARGETS_PRECOMPUTED': True,
           'BBOX_REG': True,
           'BBOX_THRESH': 0.5,
           'BG_THRESH_HI': 0.5,
           'BG_THRESH_LO': 0.0,
           'BIAS_DECAY': False,
           'BN_TRAIN': False,
           'DISPLAY': 20,
           'DOUBLE_BIAS': False,
           'FG_FRACTION': 0.25,
           'FG_THRESH': 0.5,
           'GAMMA': 0.1,
           'HAS_RPN': True,
           'IMS_PER_BATCH': 1,
           'LEARNING_RATE': 0.001,
           'MAX_SIZE': 1000,
           'MOMENTUM': 0.9,
           'PROPOSAL_METHOD': 'gt',
           'RPN_BATCHSIZE': 256,
           'RPN_BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
           'RPN_CLOBBER_POSITIVES': False,
           'RPN_FG_FRACTION': 0.5,
           'RPN_MIN_SIZE': 8,
           'RPN_NEGATIVE_OVERLAP': 0.3,
           'RPN_NMS_THRESH': 0.7,
           'RPN_POSITIVE_OVERLAP': 0.7,
           'RPN_POSITIVE_WEIGHT': -1.0,
           'RPN_POST_NMS_TOP_N': 2000,
           'RPN_PRE_NMS_TOP_N': 12000,
           'SCALES': [600],
           'SNAPSHOT_ITERS': 5000,
           'SNAPSHOT_KEPT': 3,
           'SNAPSHOT_PREFIX': 'res101_faster_rcnn',
           'STEPSIZE': [30000],
           'SUMMARY_INTERVAL': 180,
           'TRIM_HEIGHT': 600,
           'TRIM_WIDTH': 600,
           'TRUNCATED': False,
           'USE_ALL_GT': True,
           'USE_FLIPPED': True,
           'USE_GT': False,
           'WEIGHT_DECAY': 0.0001},
 'USE_GPU_NMS': True}
loading annotations into memory...
Done (t=8.68s)
creating index...
index created!
Loaded dataset `coco_2014_train` for training
Set proposal method: gt
Appending horizontally-flipped training examples...
coco_2014_train gt roidb loaded from /faster-rcnn.private/data/cache/coco_2014_train_gt_roidb.pkl
done
Preparing training data...
done
loading annotations into memory...
Done (t=4.52s)
creating index...
index created!
Loaded dataset `coco_2014_valminusminival` for training
Set proposal method: gt
Appending horizontally-flipped training examples...
coco_2014_valminusminival gt roidb loaded from /faster-rcnn.private/data/cache/coco_2014_valminusminival_gt_roidb.pkl
done
Preparing training data...
done
loading annotations into memory...
Done (t=3.64s)
creating index...
index created!
before filtering, there are 236574 images...
after filtering, there are 234532 images...
234532 roidb entries
Loading pretrained weights from data/pretrained_model/resnet101_caffe.pth
[session 1][epoch  1][iter    0/234532] loss: 6.2209, lr: 1.00e-03
            fg/bg=(32/96), time cost: 0.561955
            rpn_cls: 0.7366, rpn_box: 0.4397, rcnn_cls: 4.5196, rcnn_box 0.5251
[session 1][epoch  1][iter  100/234532] loss: 2.2908, lr: 1.00e-03
            fg/bg=(32/96), time cost: 32.818758
            rpn_cls: 0.2705, rpn_box: 0.1998, rcnn_cls: 0.6918, rcnn_box 0.5717
[session 1][epoch  1][iter  200/234532] loss: 2.0520, lr: 1.00e-03
            fg/bg=(32/96), time cost: 33.719465
            rpn_cls: 0.2980, rpn_box: 0.1646, rcnn_cls: 1.1626, rcnn_box 0.6135

I'm experiencing the same error with resuming training will try to debug.
I trained my model with the Cuda flag as well and trained it on a custom dataset. Unable to retrain at the moment because of the same error.

I ran into a similar problem. For me that issue was solved when moving the lines:

  if args.cuda:
    fasterRCNN.cuda()

above the assignment of the optimizer, i.e. above:

  if args.optimizer == "adam":
    lr = lr * 0.1
    optimizer = torch.optim.Adam(params)

  elif args.optimizer == "sgd":
    optimizer = torch.optim.SGD(params, momentum=cfg.TRAIN.MOMENTUM)

According to the documentation it is best practice to move the model to GPU prior to initialization/assignment of the optimizer.

@AlexanderHustinx please send a PR with that, great work!

when i kept the training , my system was restarted (for large data). please help

Where is the pytorch1.0 branch?

Was this page helpful?
0 / 5 - 0 ratings