Detectron2: Continuing training with a different config

Created on 29 Nov 2019 · 11Comments · Source: facebookresearch/detectron2

Hi, I am training a Faster-RCNN R50-FPN 3x model with a custom dataset using a code similar to the one shown in the Colab notebook. I am trying to take one of the saved models and continue training it with a different learning rate by changing the config. However, I tried changing BASE_LR and STEPS variables from both the config yaml file and the training code with no success. The training always seems to continue with the schedule it is first initialized with. Is it possible to continue training a saved model with different configs? Thank you for your help.

Source

onatsahin

👍6

Most helpful comment

has anyone solved this issue with steps?

Because the checkpoint saves trainer.scheduler.milestones,
trainer.resume_or_load(resume=True) will take the old milestones from the checkpoint and override the scheduler config.
This can be fixed by overriding the milestones again (after calling resume_or_load) with
trainer.scheduler.milestones=cfg.SOLVER.STEPS

koonyook on 8 Jun 2020

👍3 🎉1

All 11 comments

The current iteration number is saved as part of the checkpoint. If you'd like it to start from iteration zero, you can remove it from the checkpoint with torch.save and torch.load.

ppwwyyxx on 29 Nov 2019

👍1

Thank you for your answer. By removing 'optimizer' and 'scheduler' from the model dictionary I was able to set a new learning rate. However I still can't make the steps work. The iteration number passes the step but the learning rate does not drop.

onatsahin on 29 Nov 2019

Posting your full logs would help.

ppwwyyxx on 30 Nov 2019

The current iteration number is saved as part of the checkpoint. If you'd like it to start from iteration zero, you can remove it from the checkpoint with torch.save and torch.load.

can you explain it with more details? should I do this?
torch.save(trainer.model.state_dict(), "my_Path")
cfg.merge_from_file("one_of_the_default_models.yaml")
cfg..MODEL.WEIGHTS = torch.load_state_dict(torch.load("my_Path"))
and then change lr and other parameters ?

GiovanniPasq on 30 Nov 2019

Closing as the requested information is not provided.

ppwwyyxx on 12 Dec 2019

So i am also facing the same problem, and i did what @ppwwyyxx mentioned, I only saved model and iteration number and resumed the training with a new learning rate, It started with the new learning rate but again as mentioned by @onatsahin , learning rare didn't reduce after the specified steps. This is my configuration settings.

cfg = get_cfg()
cfg.merge_from_file("configs/COCO-Detection/faster_rcnn_X_101_32x8d_FPN_3x.yaml")
cfg.DATASETS.TRAIN = ("fashion_train")
cfg.DATASETS.TEST = ()   # no metrics implemented for this dataset
cfg.DATALOADER.NUM_WORKERS = 8
cfg.MODEL.WEIGHTS = "models/model_final_68b088.pkl"  # initialize from model zoo
cfg.SOLVER.IMS_PER_BATCH = 1    # Number of images processed in one iteration
cfg.SOLVER.CHECKPOINT_PERIOD = 50000  # Save
cfg.SOLVER.BASE_LR = 0.0001
cfg.SOLVER.MAX_ITER = 5500000
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 128
cfg.SOLVER.WARMUP_ITERS = 10
cfg.SOLVER.STEPS = (1950119,2012239,2112239,2212239,2412239,2812239)
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 23
# cfg.MODEL.ROI_HEADS.NMS_THRESH_TEST = 0.2
cfg.MODEL.MASK_ON=False    # No Segmentation
cfg.OUTPUT_DIR ='snapshots_v3'
os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
trainer = DefaultTrainer(cfg)
trainer.resume_or_load(resume=True)
trainer.train()

and this is the output

[01/06 16:13:15 d2.engine.train_loop]: Starting training from iteration 1950000
[01/06 16:14:01 d2.utils.events]: eta: 97 days, 17:40:04  iter: 1950019  total_loss: 0.693  loss_cls: 0.280  loss_box_reg: 0.365  loss_rpn_cls: 0.004  loss_rpn_loc: 0.022  time: 2.4198  data_time: 0.0009  lr: 0.000100  max_mem: 3747M
[01/06 16:14:49 d2.utils.events]: eta: 97 days, 17:39:17  iter: 1950039  total_loss: 0.628  loss_cls: 0.239  loss_box_reg: 0.317  loss_rpn_cls: 0.004  loss_rpn_loc: 0.020  time: 2.3958  data_time: 0.0012  lr: 0.000100  max_mem: 3747M
[01/06 16:15:38 d2.utils.events]: eta: 98 days, 2:39:44  iter: 1950059  total_loss: 0.601  loss_cls: 0.311  loss_box_reg: 0.305  loss_rpn_cls: 0.014  loss_rpn_loc: 0.011  time: 2.4087  data_time: 0.0013  lr: 0.000100  max_mem: 3747M
[01/06 16:16:26 d2.utils.events]: eta: 98 days, 15:02:55  iter: 1950079  total_loss: 0.521  loss_cls: 0.187  loss_box_reg: 0.330  loss_rpn_cls: 0.001  loss_rpn_loc: 0.014  time: 2.4068  data_time: 0.0013  lr: 0.000100  max_mem: 3747M
[01/06 16:17:13 d2.utils.events]: eta: 98 days, 6:15:08  iter: 1950099  total_loss: 0.582  loss_cls: 0.178  loss_box_reg: 0.365  loss_rpn_cls: 0.002  loss_rpn_loc: 0.017  time: 2.4021  data_time: 0.0014  lr: 0.000100  max_mem: 3747M
[01/06 16:18:01 d2.utils.events]: eta: 98 days, 6:14:20  iter: 1950119  total_loss: 0.618  loss_cls: 0.219  loss_box_reg: 0.347  loss_rpn_cls: 0.004  loss_rpn_loc: 0.016  time: 2.3980  data_time: 0.0012  lr: 0.000100  max_mem: 3747M
[01/06 16:18:49 d2.utils.events]: eta: 98 days, 10:33:12  iter: 1950139  total_loss: 0.529  loss_cls: 0.168  loss_box_reg: 0.325  loss_rpn_cls: 0.003  loss_rpn_loc: 0.011  time: 2.3970  data_time: 0.0012  lr: 0.000100  max_mem: 3747M

It started with the correct iteration number (1950000) and the learning rate I specified, but i was expecting it to reduce the learning rate after (1950119) iterations as defined in STEPS, But it didn't.
What am I missing ?

rsadiq on 6 Jan 2020

You also need to keep "scheduler" in the checkpoint, otherwise the scheduler starts from 0.

ppwwyyxx on 6 Jan 2020

Thank you for your response @ppwwyyxx. So when we add ""scheduler"" to the checkpoint, it starts from the previous learning rate (not the one we define as baselr now). and again after going through the steps, it doesnot decrease the lr.
PS: I also tried adding the new steps as milestones and new learning rates to the "scheduler", it will start with new lr in that case but still it doesnot drop the lr.

rsadiq on 6 Jan 2020

👍1

has anyone solved this issue with steps?