was training COCO... everything went smoothly and I managed to get into 3rd epoch and paused the training. Went to restart, and got errors. Unsure how to proceed with debugging or cleaning up..
CUDA_VISIBLE_DEVICES=0 python trainval_net.py --dataset coco --net res101 --bs 1 --nw 1 --lr .001 --lr_decay_step 10 --cuda --r true --checksession 1 --checkepoch 2 --checkpoint 234531 --use_tfb
234532 roidb entries
Loading pretrained weights from data/pretrained_model/resnet101_caffe.pth
loading checkpoint models/res101/coco/faster_rcnn_1_2_234531.pth
loaded checkpoint models/res101/coco/faster_rcnn_1_2_234531.pth
Traceback (most recent call last):
File "trainval_net.py", line 339, in <module>
optimizer.step()
File "/python3.6/site-packages/torch/optim/sgd.py", line 101, in step
buf.mul_(momentum).add_(1 - dampening, d_p)
RuntimeError: expected type torch.FloatTensor but got torch.cuda.FloatTensor
(pytorch1_py36) emcp@k:faster-rcnn.pytorch$
I guess you need to convert cpu model to gpu
On Wed, Mar 13, 2019 at 5:52 PM Erik notifications@github.com wrote:
was training COCO... everything went smoothly and I managed to get into
3rd epoch and paused the training. Went to restart, and got errors. Unsure
how to proceed with debugging or cleaning up..CUDA_VISIBLE_DEVICES=0 python trainval_net.py --dataset coco --net res101 --bs 1 --nw 1 --lr .001 --lr_decay_step 10 --cuda --r true --checksession 1 --checkepoch 2 --checkpoint 234531 --use_tfb
234532 roidb entries
Loading pretrained weights from data/pretrained_model/resnet101_caffe.pth
loading checkpoint models/res101/coco/faster_rcnn_1_2_234531.pth
loaded checkpoint models/res101/coco/faster_rcnn_1_2_234531.pth
Traceback (most recent call last):
File "trainval_net.py", line 339, in
optimizer.step()
File "/python3.6/site-packages/torch/optim/sgd.py", line 101, in step
buf.mul_(momentum).add_(1 - dampening, d_p)
RuntimeError: expected type torch.FloatTensor but got torch.cuda.FloatTensor
(pytorch1_py36) emcp@k:faster-rcnn.pytorch$—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/jwyang/faster-rcnn.pytorch/issues/475, or mute the
thread
https://github.com/notifications/unsubscribe-auth/ADtr5wznoyY2kIsOFC1yY7HQ5I4KeeeBks5vWXMVgaJpZM4bwcqX
.
I ran everything with cuda ON, according to my inputs at the terminal..
will double check though
That model came from this call.. so I am a little lost how it's thinking it is a CPU model
$ CUDA_VISIBLE_DEVICES=0 python trainval_net.py --dataset coco --net res101 --bs 1 --nw 1 --lr .001 --lr_decay_step 10 --cuda
i had paused it manually at epoch 3, did that cause an issue perhaps?
full output of the call that was eventually paused
(pytorch1_py36) emcp@k:/faster-rcnn.private$ CUDA_VISIBLE_DEVICES=0 python trainval_net.py --dataset coco --net res101 --bs 1 --nw 1 --lr .001 --lr_decay_step 10 --cuda
Called with args:
Namespace(batch_size=1, checkepoch=1, checkpoint=0, checkpoint_interval=10000, checksession=1, class_agnostic=False, cuda=True, dataset='coco', disp_interval=100, large_scale=False, lr=0.001, lr_decay_gamma=0.1, lr_decay_step=10, mGPUs=False, max_epochs=20, net='res101', num_workers=1, optimizer='sgd', resume=False, save_dir='models', session=1, start_epoch=1, use_tfboard=False)
Using config:
{'ANCHOR_RATIOS': [0.5, 1, 2],
'ANCHOR_SCALES': [4, 8, 16, 32],
'CROP_RESIZE_WITH_MAX_POOL': False,
'CUDA': False,
'DATA_DIR': '/faster-rcnn.private/data',
'DEDUP_BOXES': 0.0625,
'EPS': 1e-14,
'EXP_DIR': 'res101',
'FEAT_STRIDE': [16],
'GPU_ID': 0,
'MATLAB': 'matlab',
'MAX_NUM_GT_BOXES': 50,
'MOBILENET': {'DEPTH_MULTIPLIER': 1.0,
'FIXED_LAYERS': 5,
'REGU_DEPTH': False,
'WEIGHT_DECAY': 4e-05},
'PIXEL_MEANS': array([[[102.9801, 115.9465, 122.7717]]]),
'POOLING_MODE': 'align',
'POOLING_SIZE': 7,
'RESNET': {'FIXED_BLOCKS': 1, 'MAX_POOL': False},
'RNG_SEED': 3,
'ROOT_DIR': '/faster-rcnn.private',
'TEST': {'BBOX_REG': True,
'HAS_RPN': True,
'MAX_SIZE': 1000,
'MODE': 'nms',
'NMS': 0.3,
'PROPOSAL_METHOD': 'gt',
'RPN_MIN_SIZE': 16,
'RPN_NMS_THRESH': 0.7,
'RPN_POST_NMS_TOP_N': 300,
'RPN_PRE_NMS_TOP_N': 6000,
'RPN_TOP_N': 5000,
'SCALES': [600],
'SVM': False},
'TRAIN': {'ASPECT_GROUPING': False,
'BATCH_SIZE': 128,
'BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
'BBOX_NORMALIZE_MEANS': [0.0, 0.0, 0.0, 0.0],
'BBOX_NORMALIZE_STDS': [0.1, 0.1, 0.2, 0.2],
'BBOX_NORMALIZE_TARGETS': True,
'BBOX_NORMALIZE_TARGETS_PRECOMPUTED': True,
'BBOX_REG': True,
'BBOX_THRESH': 0.5,
'BG_THRESH_HI': 0.5,
'BG_THRESH_LO': 0.0,
'BIAS_DECAY': False,
'BN_TRAIN': False,
'DISPLAY': 20,
'DOUBLE_BIAS': False,
'FG_FRACTION': 0.25,
'FG_THRESH': 0.5,
'GAMMA': 0.1,
'HAS_RPN': True,
'IMS_PER_BATCH': 1,
'LEARNING_RATE': 0.001,
'MAX_SIZE': 1000,
'MOMENTUM': 0.9,
'PROPOSAL_METHOD': 'gt',
'RPN_BATCHSIZE': 256,
'RPN_BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
'RPN_CLOBBER_POSITIVES': False,
'RPN_FG_FRACTION': 0.5,
'RPN_MIN_SIZE': 8,
'RPN_NEGATIVE_OVERLAP': 0.3,
'RPN_NMS_THRESH': 0.7,
'RPN_POSITIVE_OVERLAP': 0.7,
'RPN_POSITIVE_WEIGHT': -1.0,
'RPN_POST_NMS_TOP_N': 2000,
'RPN_PRE_NMS_TOP_N': 12000,
'SCALES': [600],
'SNAPSHOT_ITERS': 5000,
'SNAPSHOT_KEPT': 3,
'SNAPSHOT_PREFIX': 'res101_faster_rcnn',
'STEPSIZE': [30000],
'SUMMARY_INTERVAL': 180,
'TRIM_HEIGHT': 600,
'TRIM_WIDTH': 600,
'TRUNCATED': False,
'USE_ALL_GT': True,
'USE_FLIPPED': True,
'USE_GT': False,
'WEIGHT_DECAY': 0.0001},
'USE_GPU_NMS': True}
loading annotations into memory...
Done (t=8.68s)
creating index...
index created!
Loaded dataset `coco_2014_train` for training
Set proposal method: gt
Appending horizontally-flipped training examples...
coco_2014_train gt roidb loaded from /faster-rcnn.private/data/cache/coco_2014_train_gt_roidb.pkl
done
Preparing training data...
done
loading annotations into memory...
Done (t=4.52s)
creating index...
index created!
Loaded dataset `coco_2014_valminusminival` for training
Set proposal method: gt
Appending horizontally-flipped training examples...
coco_2014_valminusminival gt roidb loaded from /faster-rcnn.private/data/cache/coco_2014_valminusminival_gt_roidb.pkl
done
Preparing training data...
done
loading annotations into memory...
Done (t=3.64s)
creating index...
index created!
before filtering, there are 236574 images...
after filtering, there are 234532 images...
234532 roidb entries
Loading pretrained weights from data/pretrained_model/resnet101_caffe.pth
[session 1][epoch 1][iter 0/234532] loss: 6.2209, lr: 1.00e-03
fg/bg=(32/96), time cost: 0.561955
rpn_cls: 0.7366, rpn_box: 0.4397, rcnn_cls: 4.5196, rcnn_box 0.5251
[session 1][epoch 1][iter 100/234532] loss: 2.2908, lr: 1.00e-03
fg/bg=(32/96), time cost: 32.818758
rpn_cls: 0.2705, rpn_box: 0.1998, rcnn_cls: 0.6918, rcnn_box 0.5717
[session 1][epoch 1][iter 200/234532] loss: 2.0520, lr: 1.00e-03
fg/bg=(32/96), time cost: 33.719465
rpn_cls: 0.2980, rpn_box: 0.1646, rcnn_cls: 1.1626, rcnn_box 0.6135
I'm experiencing the same error with resuming training will try to debug.
I trained my model with the Cuda flag as well and trained it on a custom dataset. Unable to retrain at the moment because of the same error.
I ran into a similar problem. For me that issue was solved when moving the lines:
if args.cuda:
fasterRCNN.cuda()
above the assignment of the optimizer, i.e. above:
if args.optimizer == "adam":
lr = lr * 0.1
optimizer = torch.optim.Adam(params)
elif args.optimizer == "sgd":
optimizer = torch.optim.SGD(params, momentum=cfg.TRAIN.MOMENTUM)
According to the documentation it is best practice to move the model to GPU prior to initialization/assignment of the optimizer.
@AlexanderHustinx please send a PR with that, great work!
when i kept the training , my system was restarted (for large data). please help
Where is the pytorch1.0 branch?
Most helpful comment
I ran into a similar problem. For me that issue was solved when moving the lines:
above the assignment of the optimizer, i.e. above:
According to the documentation it is best practice to move the model to GPU prior to initialization/assignment of the optimizer.