Maskrcnn-benchmark: Training on new data with my own pretrained model - model fine tuning

Created on 23 Jan 2019 · 15Comments · Source: facebookresearch/maskrcnn-benchmark

❓ Questions and Help

I have trained a model on my own dataset (4 classes + background) starting from catalog://ImageNetPretrained/MSRA/R-50.

Now I want to use the output model from the training mentioned above model_final.pth as the new starting point (instead of R-50) to train on new labeled data.
I set the MODEL.WEIGHT in my config to the path of "model_final.pth" and it is able to load the model and it has the right number of layers and dimensions. However, when running training, specifically model.train() inside engine/trainer.py function do_train(...), it skips all training and immediately finishes. As shown here:

2019-01-23 00:13:49,014 maskrcnn_benchmark.trainer INFO: Start training
2019-01-23 00:13:49,082 maskrcnn_benchmark.trainer INFO: Total training time: 0:00:00.067745 (0.0001 s / it)

Am I missing a step? If I am fine tuning with the same number of classes, I don't think I need to remove any layers (e.g. cls_score), even when I remove those the model does not train.

Thanks for the help in advanced!

awaiting response contributions welcome enhancement

Source

adrifloresm

Most helpful comment

Hi @adrifloresm ,

In order to do that, all you need to do is to open a python interpreter, load the weights and save a new dict with only the state_dict in it.

Something like

import torch

original = torch.load('path/to/your/checkpoint.pth')

new = {"model": original["model"]}
torch.save(new, 'path/to/new/checkpoint.pth')

Let me know if it solves your problem.

fmassa on 25 Jan 2019

👍6 🎉5 ❤1

All 15 comments

Hi,

What is probably happening is that your OUTPUT_DIR is the same as the previous OUTPUT_DIR from your other training run, and there is a last_checkpoint file in it. maskrcnn-benchmark identifies this file and loads the checkpoint from there.

You might want to change the OUTPUT_DIR to point to a different folder, and I believe this will fix the issue for you.

Let me know if you have further questions

fmassa on 23 Jan 2019

@fmassa thanks for the quick response. Unfortunately, the last_checkpoint being loaded is not the issue. I forgot to mentioned that I had tried training into a new output folder.

I doubled checked and the utils/checkpoint.py returns false from the has_checkpoint function.

Is it possible that it does not allow it to train because the model_final.pth was trained with the defaults _C.SOLVER.MAX_ITER = 40000, or that some of the solver configs I am using are the same as the first time I trained the model?

Thank you in advanced for your help!

adrifloresm on 23 Jan 2019

@adrifloresm

Is it possible that it does not allow it to train because the model_final.pth was trained with the defaults _C.SOLVER.MAX_ITER = 40000, or that some of the solver configs I am using are the same as the first time I trained the model?

Hum, that's interesting. Can you try setting SOLVER.MAX_ITER to something like 50000 and see if it starts training? If yes, then this might be an unexpected behavior which should be fixed I think

fmassa on 23 Jan 2019

@fmassa yes that was exactly the issue! The system does not train unless it has a higher number of iterations from the initial train.

Do you have any ideas how to avoid it? I would like to be able to train only a few epochs every time new labeled data arrives.

adrifloresm on 23 Jan 2019

Also note that the training log shows that training starts after the last max_iteration. It only trains from the prior MAX_ITER to the new MAX_ITER.

For example:
2019-01-23 19:12:15,924 maskrcnn_benchmark.trainer INFO: eta: 0:10:28 iter: **40260** loss: 0.0200 (0.0217) loss_box_reg: 0.0018 (0.0025) loss_classifier: 0.0092 (0.0117) loss_objectness: 0.0012 (0.0059) loss_rpn_box_reg: 0.0004 (0.0015) time: 0.8148 (0.8487) data: 0.0020 (0.0414) lr: 0.000100 max mem: 3817

adrifloresm on 23 Jan 2019

I think the problem is that you not only loaded model weights but also optimizer state and scheduler state when you are loading pretrained-weights, which is the default setting in this project. This works well if your training is stopped unexpected and you can continue the training with optimizer state and scheduler state. But it is maybe not what you want if you want to start another training schedule based on well-trained model.
An easy solution is that you only save model weights when saving and just load model weights when loading pretrained model.

txytju on 24 Jan 2019

Thanks for the confirmation.

So indeed, the problem is with the way we restart jobs (in which case we want to load from the exact point of the last checkpoint) and starting a new training job from a full checkpoint.

A few questions:

do you want to use the same optimizer / learning rate schedules as the one just after the initial training was over?

I think we might need to add an extra config option for this particular use-case (load everything, but ignore the iteration counts). Thoughts?

fmassa on 24 Jan 2019

Thank you for the reply!

@fmassa I don't want to use the same optimizer or learning rate scheduler. I think the solution I am looking for is just loading the model weights without the optimizer and schedule state like @txytju mentioned.

How can I save just the model weights and how do I load just the model weights?
Can I do this for the model I already have trained (model_final.pth)

Thanks in advanced for your help.

adrifloresm on 24 Jan 2019

Hi @adrifloresm ,

In order to do that, all you need to do is to open a python interpreter, load the weights and save a new dict with only the state_dict in it.

Something like

import torch

original = torch.load('path/to/your/checkpoint.pth')

new = {"model": original["model"]}
torch.save(new, 'path/to/new/checkpoint.pth')

Let me know if it solves your problem.

fmassa on 25 Jan 2019

👍6 🎉5 ❤1

@fmassa Thank you so much for your reply and example code! It worked like a charm! I appreciate all your help!

adrifloresm on 26 Jan 2019

👍2

Thanks for the confirmation.

So indeed, the problem is with the way we restart jobs (in which case we want to load from the exact point of the last checkpoint) and starting a new training job from a full checkpoint.

A few questions:

do you want to use the same optimizer / learning rate schedules as the one just after the initial training was over?

I think we might need to add an extra config option for this particular use-case (load everything, but ignore the iteration counts). Thoughts?

What do you think of having an option LOAD_ONLY_WEIGHTS ? Maybe this is the most common use-case.

LeviViana on 5 Feb 2019

@LeviViana adding this as an option is a possibility, but I'm not sure it's the best approach here.
For example, if you also want to remove the last layer to train on a new dataset, should we add another option to the config REMOVE_LAST_LAYER_WEIGHTS or leave this to the user to do?

Given that, the how the config could potentially become more complicated, I'm inclined towards just letting the user perform the model surgery themselves, as it's more general and do not require many manual steps anyway.

But I might be missing something here, so do let me know if you think otherwise.

fmassa on 6 Feb 2019

I agree. My idea was basically to imagine how could we add only a couple of config options that addresses that majority of use cases. This would only marginally increase the complexity of the config options and bring some more enhancement. But I don't know what those config options should be.

Regarding the model surgery, I think that this is a great feature of this library. There is some documentation available explaining how to do it properly, but maybe some users are interested in a example of fine tuning, a sort of tutorial. What do think?

LeviViana on 6 Feb 2019

A tutorial on how to finetune a model for a new dataset would be great!

I'd like to hold on a bit on adding yet more options in the config for now. I think that once we have the tutorial we would be able to see what are the biggest pain points, and have a better idea on how to address them.

fmassa on 8 Feb 2019

👍3

What about adding a code at train_net.py below:
arguments.update(extra_checkpoint_data)
add:
arguments["iteration"] = 1

TyroneLi on 26 Mar 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

RuntimeError: CUDA error: out of memory Singal GPU Navida 1080TI

Idolized22 · 3Comments

Get 0 AP and AR when testing, and the inference result is very bad.

KuribohG · 3Comments

Raise ValueError: Type mismatch (<type 'str'> vs. <type 'tuple'>) with values (coco_2017_train vs. ('coco_2017_train',)) for config key: DATASETS.TRAIN

SkeletonOne · 3Comments

size mismatch

CF2220160244 · 3Comments

COCODemo object: pickle data was truncated

BelhalK · 4Comments