I am trying to train a pre-trained reset50 with a small network on top of it frozen and then unfreeze and train the whole network.
My questions are:
1 - Where is the most appropriate place in the framework to create parameter groups?
2 - Does it make sense to add options to freeze/unfreeze to support selectively freezing groups
Regarding the current implementation of freeze/unfreeze, it has the side effect of setting the state of the model to eval/train. This seems inappropriate. If this is of interest, I am happy to make a pull request.
Great questions.
def __init__(self, ...):
self.pretrained_model = SomeModel.load_from_...()
self.pretrained_model.freeze()
self.finetune_model = ...
def configure_optimizers(self):
return Adam(self.pretrained_model.parameters(), ...)
Hey William. I am a bit confused on that response. Let me be more clear on the process.
I have defined my model as
```def __build_model(self):
self.layer_groups = OrderedDict()
self.resnet = nn.Sequential(
*list(models.resnet50(pretrained=True).children())[:-2]
)
self.layer_groups["resnet"] = self.resnet
self.classifier_head = nn.Sequential(
*[
nn.AdaptiveAvgPool2d(output_size=1),
nn.AdaptiveMaxPool2d(output_size=1),
nn.Flatten(),
nn.BatchNorm1d(
2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True
),
nn.Dropout(p=0.25),
nn.Linear(2048, 512, bias=True),
nn.ReLU(),
nn.BatchNorm1d(
512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True
),
nn.Dropout(p=0.5),
nn.Linear(512, 2, bias=True),
]
)
self.layer_groups["classifier_head"] = self.classifier_head
I would like to train this full network with self.resnet frozen for a few epochs. Then unfreeze this network and train the whole model some more.
To accomplish this
def freeze_to(self, n:int, exclude_types=(nn.BatchNorm1d, nn.BatchNorm2d, nn.BatchNorm3d)) -> None:
"""Freeze layers up to layer group n.
Look at each group, and freeze each paraemeter, except excluded types
"""
print("freezing", f"freeze_to model called, level requested {n}")
def set_requires_grad_for_module(module: nn.Module, requires_grad: bool):
"Sets each parameter in lthe module to the `requires_grad` value"
params = list(module.parameters())
for param in params:
param.requires_grad = requires_grad
#layer groups is an ordered dict, get the keys up to the nth layer
for group_key in list(self.layer_groups)[:n]:
group = self.layer_groups[group_key]
for layer in group:
if not isinstance(layer, exclude_types):
set_requires_grad_for_module(layer, False)
for group_key in list(self.layer_groups)[n:]:
group = self.layer_groups[group_key]
set_requires_grad_for_module(group, True)
def freeze(self) -> None:
self.freeze_to(len(self.layer_groups))
def unfreeze(self):
self.freeze_to(0)
```
It seems to me, in this example you have given, that when you try to train, the model will be stuck in eval mode because of the call to model.eval. I think you may have the idea that pre-trained is a lightning module that has been trained. Is this the case? If so, that makes using pre-trained networks need to be wrapped in Lightning Module which I think goes against the idea of "just pytorch."
Hi, I too would be interested in a step-by-step 'tutorial' for doing transfer learning with Pytorch-Lightning. Is this something that might be added to the docs?
@LucFrachon would you interested in creating such tutorial?
@jeremyjordan @awaelchli pls ^^
I'm not sure I am fully qualified for this :-)
I believe such a tutorial should at least cover:
I can see how to do these things individually, but I struggle to see how to integrate them elegantly in a LightningModule...
@Borda @LucFrachon I was working on something very similar these last few days.
What I want is to have a pre-trained feature extractor (say ResNet50) and be able to do the following (part of an ongoing work to reproduce the results from a research paper):
@jbschiratti nice example, what dataset are you using?
https://gist.github.com/jbschiratti/e93f1ff9cc518a93769101044160d64d#file-fine_tuning-py-L244-L260
from torchvision.datasets import FakeData I used this fake dataset just as proof of concept.
cool, mind sending PR as a lightning example?
just thinking to have it run on real (small) dataset...
cc: @PyTorchLightning/core-contributors @williamFalcon
Sure ! But I think it would be more relevant with some real images (the dataset used in this example, for instance). Don't you think?
well, real-world examples would be nice nut we still need to stay in kind of minimal mode, we do not want a user to download couple gb dataset just for an example... BUT your Ants-Bets looks good
Hi, i created a similar example using fastai and pytorch-lightning. Might be useful for some one https://github.com/sairahul/mlexperiments/blob/master/pytorch-lightning/fine_tuning_example.py
@Borda I eventually made a PR with a slightly modified version of the example I proposed.
@Borda I eventually made a PR with a slightly modified version of the example I proposed.
Thank you for the tutorial! This is much needed. I have a small question -- is there a reason why the frozen parameters cannot be added to the optimizer from the beginning, and has to be added on specific epochs? Afaik if .requires_grad is False, the optimizers will ignore the parameter.
@lizhitwo Sure! I added the parameters separately to emphasize the idea that these parameters were not trained before a given epoch. AFAIK, what you're proposing would work as well!
@jbschiratti Thanks for the explanation!
@jbschiratti (and maybe someone else), is there a big advantage in either approach? I understand you made it in this way to emphasis the approach but in general which would be more optimal?
I came across a discussion on pytorch forum (https://discuss.pytorch.org/t/passing-a-subset-of-the-parameters-to-an-optimizer-equivalent-to-setting-requires-grad-of-subset-only-to-true/42866) where it is suggested that passing all the parameters to the optimizer and then marking the frozen ones with requires_grad=False flag prevents the gradients from being computed, and subsequently saves some memory. Not sure if this is still relevant as the discussion is a year old...
@hcjghr The point of having parameters group (and adding the parameters sequentially as we unfreeze them) is to allow for different learning rates. I updated the example in #1564. If you had a single parameter group in your optimizer, you might not be able to use such training strategies.
@jbschiratti I completely understand it now and I definitely agree this is the best way to have it in the example to show different possibilities. Thanks for the quick explanation.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hey I'm trying to do something similar with some categorical data where I want to freeze the pre-trained model for the initial training then train the full network slightly. I struggled to find the example as its not mentioned in the docs but is hidden away in the github repo. The README.md for the parent directory doesn't mention this example either so its not very visible. Although, I think it is a super useful example for transfer learning!
Most helpful comment
Hi, I too would be interested in a step-by-step 'tutorial' for doing transfer learning with Pytorch-Lightning. Is this something that might be added to the docs?