Hi, I am training a Faster-RCNN R50-FPN 3x model with a custom dataset using a code similar to the one shown in the Colab notebook. I am trying to take one of the saved models and continue training it with a different learning rate by changing the config. However, I tried changing BASE_LR and STEPS variables from both the config yaml file and the training code with no success. The training always seems to continue with the schedule it is first initialized with. Is it possible to continue training a saved model with different configs? Thank you for your help.
The current iteration number is saved as part of the checkpoint. If you'd like it to start from iteration zero, you can remove it from the checkpoint with torch.save and torch.load.
Thank you for your answer. By removing 'optimizer' and 'scheduler' from the model dictionary I was able to set a new learning rate. However I still can't make the steps work. The iteration number passes the step but the learning rate does not drop.
Posting your full logs would help.
The current iteration number is saved as part of the checkpoint. If you'd like it to start from iteration zero, you can remove it from the checkpoint with
torch.saveandtorch.load.
can you explain it with more details? should I do this?
torch.save(trainer.model.state_dict(), "my_Path")
cfg.merge_from_file("one_of_the_default_models.yaml")
cfg..MODEL.WEIGHTS = torch.load_state_dict(torch.load("my_Path"))
and then change lr and other parameters ?
Closing as the requested information is not provided.
So i am also facing the same problem, and i did what @ppwwyyxx mentioned, I only saved model and iteration number and resumed the training with a new learning rate, It started with the new learning rate but again as mentioned by @onatsahin , learning rare didn't reduce after the specified steps. This is my configuration settings.
cfg = get_cfg()
cfg.merge_from_file("configs/COCO-Detection/faster_rcnn_X_101_32x8d_FPN_3x.yaml")
cfg.DATASETS.TRAIN = ("fashion_train")
cfg.DATASETS.TEST = () # no metrics implemented for this dataset
cfg.DATALOADER.NUM_WORKERS = 8
cfg.MODEL.WEIGHTS = "models/model_final_68b088.pkl" # initialize from model zoo
cfg.SOLVER.IMS_PER_BATCH = 1 # Number of images processed in one iteration
cfg.SOLVER.CHECKPOINT_PERIOD = 50000 # Save
cfg.SOLVER.BASE_LR = 0.0001
cfg.SOLVER.MAX_ITER = 5500000
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 128
cfg.SOLVER.WARMUP_ITERS = 10
cfg.SOLVER.STEPS = (1950119,2012239,2112239,2212239,2412239,2812239)
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 23
# cfg.MODEL.ROI_HEADS.NMS_THRESH_TEST = 0.2
cfg.MODEL.MASK_ON=False # No Segmentation
cfg.OUTPUT_DIR ='snapshots_v3'
os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
trainer = DefaultTrainer(cfg)
trainer.resume_or_load(resume=True)
trainer.train()
and this is the output
[01/06 16:13:15 d2.engine.train_loop]: Starting training from iteration 1950000
[01/06 16:14:01 d2.utils.events]: eta: 97 days, 17:40:04 iter: 1950019 total_loss: 0.693 loss_cls: 0.280 loss_box_reg: 0.365 loss_rpn_cls: 0.004 loss_rpn_loc: 0.022 time: 2.4198 data_time: 0.0009 lr: 0.000100 max_mem: 3747M
[01/06 16:14:49 d2.utils.events]: eta: 97 days, 17:39:17 iter: 1950039 total_loss: 0.628 loss_cls: 0.239 loss_box_reg: 0.317 loss_rpn_cls: 0.004 loss_rpn_loc: 0.020 time: 2.3958 data_time: 0.0012 lr: 0.000100 max_mem: 3747M
[01/06 16:15:38 d2.utils.events]: eta: 98 days, 2:39:44 iter: 1950059 total_loss: 0.601 loss_cls: 0.311 loss_box_reg: 0.305 loss_rpn_cls: 0.014 loss_rpn_loc: 0.011 time: 2.4087 data_time: 0.0013 lr: 0.000100 max_mem: 3747M
[01/06 16:16:26 d2.utils.events]: eta: 98 days, 15:02:55 iter: 1950079 total_loss: 0.521 loss_cls: 0.187 loss_box_reg: 0.330 loss_rpn_cls: 0.001 loss_rpn_loc: 0.014 time: 2.4068 data_time: 0.0013 lr: 0.000100 max_mem: 3747M
[01/06 16:17:13 d2.utils.events]: eta: 98 days, 6:15:08 iter: 1950099 total_loss: 0.582 loss_cls: 0.178 loss_box_reg: 0.365 loss_rpn_cls: 0.002 loss_rpn_loc: 0.017 time: 2.4021 data_time: 0.0014 lr: 0.000100 max_mem: 3747M
[01/06 16:18:01 d2.utils.events]: eta: 98 days, 6:14:20 iter: 1950119 total_loss: 0.618 loss_cls: 0.219 loss_box_reg: 0.347 loss_rpn_cls: 0.004 loss_rpn_loc: 0.016 time: 2.3980 data_time: 0.0012 lr: 0.000100 max_mem: 3747M
[01/06 16:18:49 d2.utils.events]: eta: 98 days, 10:33:12 iter: 1950139 total_loss: 0.529 loss_cls: 0.168 loss_box_reg: 0.325 loss_rpn_cls: 0.003 loss_rpn_loc: 0.011 time: 2.3970 data_time: 0.0012 lr: 0.000100 max_mem: 3747M
It started with the correct iteration number (1950000) and the learning rate I specified, but i was expecting it to reduce the learning rate after (1950119) iterations as defined in STEPS, But it didn't.
What am I missing ?
You also need to keep "scheduler" in the checkpoint, otherwise the scheduler starts from 0.
Thank you for your response @ppwwyyxx. So when we add ""scheduler"" to the checkpoint, it starts from the previous learning rate (not the one we define as baselr now). and again after going through the steps, it doesnot decrease the lr.
PS: I also tried adding the new steps as milestones and new learning rates to the "scheduler", it will start with new lr in that case but still it doesnot drop the lr.
has anyone solved this issue with steps?
Save the model in this way
torch.save(trainer.model.state_dict(),"your path")
It will save only the model with its weights
has anyone solved this issue with steps?
Because the checkpoint saves trainer.scheduler.milestones,
trainer.resume_or_load(resume=True) will take the old milestones from the checkpoint and override the scheduler config.
This can be fixed by overriding the milestones again (after calling resume_or_load) with
trainer.scheduler.milestones=cfg.SOLVER.STEPS
Most helpful comment
Because the checkpoint saves trainer.scheduler.milestones,
trainer.resume_or_load(resume=True) will take the old milestones from the checkpoint and override the scheduler config.
This can be fixed by overriding the milestones again (after calling resume_or_load) with
trainer.scheduler.milestones=cfg.SOLVER.STEPS