Maskrcnn-benchmark: How to resume training with a different learning rate

Created on 19 Feb 2019 · 7Comments · Source: facebookresearch/maskrcnn-benchmark

❓ Questions and Help

I wonder if there is a way to change the learning rate when resume training from a checkpoint? I want to do this because when I face the problem of leaning rate become 'nan' and do not want to retrain it from the start, changing the learning rate then resume training from the latest checkpoint seems to be a rational solution. (my guess)

BTW, I have tried to change the SOLVER.BASE_LR in the configure file. But it does not work.

question

Source

arthursdays

Most helpful comment

❓ Questions and Help

I wonder if there is a way to change the learning rate when resume training from a checkpoint? I want to do this because when I face the problem of leaning rate become 'nan' and do not want to retrain it from the start, changing the learning rate then resume training from the latest checkpoint seems to be a rational solution. (my guess)

BTW, I have tried to change the SOLVER.BASE_LR in the configure file. But it does not work.

Hi, you can comment self.optimizer.load_state_dict(checkpoint.pop("optimizer")) and self.scheduler.load_state_dict(checkpoint.pop("scheduler")) in the maskrcnn_benchmark/utils/checkpoint.py like this:

        if "optimizer" in checkpoint and self.optimizer:
            self.logger.info("Loading optimizer from {}".format(f))
            # self.optimizer.load_state_dict(checkpoint.pop("optimizer"))
        if "scheduler" in checkpoint and self.scheduler:
            self.logger.info("Loading scheduler from {}".format(f))
            # self.scheduler.load_state_dict(checkpoint.pop("scheduler"))

Then it will use the optimizer and scheduler that you defined in the train_net.py，so that theSOLVER.BASE_LR will work.

gitarya on 27 Feb 2019

👍9 🎉4

All 7 comments

Hi,

The learning rate is stored in the checkpoint, that's why it doesn't work in your case.

I'd say at least you'd need to load the checkpoint in an interpreter, and remove / change the learning rate scheduler.

You might also want to remove the optimizer, as it might also contain bugged data.

Let me know if you have further questions

fmassa on 19 Feb 2019

Hi,

Thanks for your kind suggestion.

I wonder that could a small learning rate also be a reason for loss to be "nan"? I found a weird thing that when I tried to train my model with learning rate 0.002. The loss became "nan" after 12500+ iterations. However, after I changed it to 0.0015, a smaller learning rate as suggested in #33 , the loss became "nan" only after 9000+ iterations.

FYI, I was using different GPUs: 2 Titan X (pascal) and 1 Titan Xp at the same time. Could this be a possible reason? My IMS_PER_BATCH is 6 and I was using the model Faster-RCNN X-101. The learning rate sould be 0.00375 if I follow the rule here.

arthursdays on 19 Feb 2019

This is weird. Can you try using only 2 GPUs? There might be some weird mix happening somewhere.
Also, is this one of the standard models, or does it have something new?

fmassa on 19 Feb 2019

I have tried the similar configuration on 2 RTX2070s. It seems it worked pretty well. I have not tried it on TitanX. I will tell you once I get the result.

arthursdays on 20 Feb 2019

👍1

Well I finally found the problem was not because that I was using different GPUs at the same time. It was because I had more than one images on every GPU per batch. It can run perfectly on RTX2070s because the 2070's memory is not enough for two images. Now I can run the model in parallel on 4 different GPUs as long as I keep 1 images per GPU per batch. But the 'nan' problem still exists if I have more than one image per GPU per batch. I tried to run the same model on 2 TitanX with 2 images per GPU per batch. The result would still be 'nan' after several thousands of iterations regardless of the learning rate.
By the way, the models I used were the standard models under the "config/" directory.

arthursdays on 21 Feb 2019

If you haven't changed anything in the original configs and followed the learning rate adaptation rules, then this is weird for me and I haven't faced it. It might be some weirdness happening in your version of cudnn or PyTorch maybe?
Anyway, glad to see that you managed to fix your problem

fmassa on 22 Feb 2019

❓ Questions and Help

I wonder if there is a way to change the learning rate when resume training from a checkpoint? I want to do this because when I face the problem of leaning rate become 'nan' and do not want to retrain it from the start, changing the learning rate then resume training from the latest checkpoint seems to be a rational solution. (my guess)

BTW, I have tried to change the SOLVER.BASE_LR in the configure file. But it does not work.

        if "optimizer" in checkpoint and self.optimizer:
            self.logger.info("Loading optimizer from {}".format(f))
            # self.optimizer.load_state_dict(checkpoint.pop("optimizer"))
        if "scheduler" in checkpoint and self.scheduler:
            self.logger.info("Loading scheduler from {}".format(f))
            # self.scheduler.load_state_dict(checkpoint.pop("scheduler"))

Then it will use the optimizer and scheduler that you defined in the train_net.py，so that theSOLVER.BASE_LR will work.

gitarya on 27 Feb 2019

👍9 🎉4

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Run coco panoptic dataset

YuShen1116 · 4Comments

Get 0 AP and AR when testing, and the inference result is very bad.

KuribohG · 3Comments

OSError: [Errno 24] Too many open files

zimenglan-sysu-512 · 3Comments

Raise ValueError: Type mismatch (<type 'str'> vs. <type 'tuple'>) with values (coco_2017_train vs. ('coco_2017_train',)) for config key: DATASETS.TRAIN

SkeletonOne · 3Comments

How to Improve?

jbitton · 4Comments