Pytorch-lightning: Parameter Groups / Transfer Learning

Created on 15 Nov 2019 · 21Comments · Source: PyTorchLightning/pytorch-lightning

I am trying to train a pre-trained reset50 with a small network on top of it frozen and then unfreeze and train the whole network.

My questions are:
1 - Where is the most appropriate place in the framework to create parameter groups?
2 - Does it make sense to add options to freeze/unfreeze to support selectively freezing groups

Regarding the current implementation of freeze/unfreeze, it has the side effect of setting the state of the model to eval/train. This seems inappropriate. If this is of interest, I am happy to make a pull request.

enhancement help wanted question tutorial / example won't fix

Source

DrClick

Most helpful comment

Hi, I too would be interested in a step-by-step 'tutorial' for doing transfer learning with Pytorch-Lightning. Is this something that might be added to the docs?

LucFrachon on 22 Jan 2020

👍5

All 21 comments

Great questions.

i would structure transfer learning like this:


def __init__(self, ...):
    self.pretrained_model = SomeModel.load_from_...()
    self.pretrained_model.freeze()

    self.finetune_model = ...


def configure_optimizers(self):
    return Adam(self.pretrained_model.parameters(), ...)

.eval disables dropout and batchnorm which you don't get just from disabling grad. this means both are necessary to use pretrained as a feature extractor.

williamFalcon on 15 Nov 2019

Hey William. I am a bit confused on that response. Let me be more clear on the process.
I have defined my model as
```def __build_model(self):
self.layer_groups = OrderedDict()

    self.resnet = nn.Sequential(
        *list(models.resnet50(pretrained=True).children())[:-2]
    )
    self.layer_groups["resnet"] = self.resnet

    self.classifier_head = nn.Sequential(
        *[
            nn.AdaptiveAvgPool2d(output_size=1),
            nn.AdaptiveMaxPool2d(output_size=1),
            nn.Flatten(),
            nn.BatchNorm1d(
                2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True
            ),
            nn.Dropout(p=0.25),
            nn.Linear(2048, 512, bias=True),
            nn.ReLU(),
            nn.BatchNorm1d(
                512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True
            ),
            nn.Dropout(p=0.5),
            nn.Linear(512, 2, bias=True),
        ]
    )
    self.layer_groups["classifier_head"] = self.classifier_head


I would like to train this full network with self.resnet frozen for a few epochs. Then unfreeze this network and train the whole model some more. 

To accomplish this

def freeze_to(self, n:int, exclude_types=(nn.BatchNorm1d, nn.BatchNorm2d, nn.BatchNorm3d)) -> None:
"""Freeze layers up to layer group n.

    Look at each group, and freeze each paraemeter, except excluded types
    """

    print("freezing", f"freeze_to model called, level requested {n}")

    def set_requires_grad_for_module(module: nn.Module, requires_grad: bool):
        "Sets each parameter in lthe module to the `requires_grad` value"
        params = list(module.parameters())

        for param in params: 
            param.requires_grad = requires_grad

    #layer groups is an ordered dict, get the keys up to the nth layer
    for group_key in list(self.layer_groups)[:n]:
        group = self.layer_groups[group_key]
        for layer in group:
            if not isinstance(layer, exclude_types): 
                set_requires_grad_for_module(layer, False)

    for group_key in list(self.layer_groups)[n:]:
        group = self.layer_groups[group_key]
        set_requires_grad_for_module(group, True)

def freeze(self) -> None:
    self.freeze_to(len(self.layer_groups))

def unfreeze(self):
   self.freeze_to(0)

```

It seems to me, in this example you have given, that when you try to train, the model will be stuck in eval mode because of the call to model.eval. I think you may have the idea that pre-trained is a lightning module that has been trained. Is this the case? If so, that makes using pre-trained networks need to be wrapped in Lightning Module which I think goes against the idea of "just pytorch."

DrClick on 15 Nov 2019

👍2

Hi, I too would be interested in a step-by-step 'tutorial' for doing transfer learning with Pytorch-Lightning. Is this something that might be added to the docs?

LucFrachon on 22 Jan 2020

👍5

@LucFrachon would you interested in creating such tutorial?
@jeremyjordan @awaelchli pls ^^

Borda on 26 Mar 2020

I'm not sure I am fully qualified for this :-)
I believe such a tutorial should at least cover:

Freezing parts of a model for a few epochs
Unfreezing and fine-tuning (in line with @DrClick 's ideas), possibly with a different learning rate
Setting up different learning rates/weight decay factors AND different learning rate schedules for different parameter groups

I can see how to do these things individually, but I struggle to see how to integrate them elegantly in a LightningModule...

LucFrachon on 31 Mar 2020

@Borda @LucFrachon I was working on something very similar these last few days.
What I want is to have a pre-trained feature extractor (say ResNet50) and be able to do the following (part of an ongoing work to reproduce the results from a research paper):

keep the feature extractor frozen with lr = 1e-2 for a few epochs
unfreeze the feature extractor and train with lr = 1e-3 for a few epochs
keep training with lr = 1e-4.
I wrote this gist. Here, the network is training on dummy data (just noise). Still, if you think this is relevant, I would be happy to disucss it with you and make it into a pytorch-lightning tutorial.

jbschiratti on 8 Apr 2020

@jbschiratti nice example, what dataset are you using?
https://gist.github.com/jbschiratti/e93f1ff9cc518a93769101044160d64d#file-fine_tuning-py-L244-L260

Borda on 8 Apr 2020

from torchvision.datasets import FakeData I used this fake dataset just as proof of concept.

jbschiratti on 8 Apr 2020

cool, mind sending PR as a lightning example?
just thinking to have it run on real (small) dataset...
cc: @PyTorchLightning/core-contributors @williamFalcon

Borda on 8 Apr 2020

Sure ! But I think it would be more relevant with some real images (the dataset used in this example, for instance). Don't you think?

jbschiratti on 9 Apr 2020

well, real-world examples would be nice nut we still need to stay in kind of minimal mode, we do not want a user to download couple gb dataset just for an example... BUT your Ants-Bets looks good

Borda on 9 Apr 2020

Hi, i created a similar example using fastai and pytorch-lightning. Might be useful for some one https://github.com/sairahul/mlexperiments/blob/master/pytorch-lightning/fine_tuning_example.py

sairahul on 14 Apr 2020

@Borda I eventually made a PR with a slightly modified version of the example I proposed.

jbschiratti on 22 Apr 2020

@Borda I eventually made a PR with a slightly modified version of the example I proposed.

Thank you for the tutorial! This is much needed. I have a small question -- is there a reason why the frozen parameters cannot be added to the optimizer from the beginning, and has to be added on specific epochs? Afaik if .requires_grad is False, the optimizers will ignore the parameter.

lizhitwo on 23 Apr 2020

@lizhitwo Sure! I added the parameters separately to emphasize the idea that these parameters were not trained before a given epoch. AFAIK, what you're proposing would work as well!

jbschiratti on 24 Apr 2020

@jbschiratti Thanks for the explanation!

lizhitwo on 24 Apr 2020

@jbschiratti (and maybe someone else), is there a big advantage in either approach? I understand you made it in this way to emphasis the approach but in general which would be more optimal?

I came across a discussion on pytorch forum (https://discuss.pytorch.org/t/passing-a-subset-of-the-parameters-to-an-optimizer-equivalent-to-setting-requires-grad-of-subset-only-to-true/42866) where it is suggested that passing all the parameters to the optimizer and then marking the frozen ones with requires_grad=False flag prevents the gradients from being computed, and subsequently saves some memory. Not sure if this is still relevant as the discussion is a year old...

hcjghr on 28 Apr 2020

@hcjghr The point of having parameters group (and adding the parameters sequentially as we unfreeze them) is to allow for different learning rates. I updated the example in #1564. If you had a single parameter group in your optimizer, you might not be able to use such training strategies.

jbschiratti on 28 Apr 2020

@jbschiratti I completely understand it now and I definitely agree this is the best way to have it in the example to show different possibilities. Thanks for the quick explanation.

hcjghr on 28 Apr 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 27 Jun 2020

Hey I'm trying to do something similar with some categorical data where I want to freeze the pre-trained model for the initial training then train the full network slightly. I struggled to find the example as its not mentioned in the docs but is hidden away in the github repo. The README.md for the parent directory doesn't mention this example either so its not very visible. Although, I think it is a super useful example for transfer learning!