Pytorch-lightning: Differential learning rates and parameter groups

Created on 29 May 2020 · 10Comments · Source: PyTorchLightning/pytorch-lightning

🚀 Feature

Implement a class (possibly a ModuleMixin) that makes it easy for the user to define parameter groups (PGs) for his module.
Once we have the PGs we can start semi-automatically handling the differential learning rates (being left to the user to define it's values)

Motivation

Improve the transfer learning workflow

Pitch

The usage of differential learning rates is essential when working with transfer learning

Additional context

One implemented, differential learning rates should look something like this:

In the above image each parameter group follows the OneCycleLR schedule, but with different max_lr. Earlier layers train with a lower learning rate compared to new ones

enhancement help wanted won't fix

Source

lgvaz

👍2

Most helpful comment

This is the code I used once while implementing differential learning rates for a BERT model.

def configure_optimizers(self):
    params = list(model.named_parameters())

    def is_backbone(n): return 'bert' in n

    grouped_parameters = [
        {"params": [p for n, p in params if is_backbone(n)], 'lr': args.lr},
        {"params": [p for n, p in params if not is_backbone(n)], 'lr': args.lr * 100},
    ]

    optimizer = torch.optim.AdamW(
        grouped_parameters, lr=args.lr, weight_decay=0
    )

    return optimizer

Now I think a Class would be a better option if we want freezing and unfreezing functionalities for param groups. A method something like update_group_params_hyperparams within that class can also be a good choice to update the hyperparams for each param_group and not just learning rate.

rohitgr7 on 30 May 2020

❤3

All 10 comments

Differential learning rates can easily be implemented within configure_optimizers using param_groups. Why a new class is required??

rohitgr7 on 29 May 2020

@rohitgr7 can you give an example?

williamFalcon on 29 May 2020

Differential learning rates are such a common necessity when doing transfer learning that would be great to have a helper class to work with that.

This class will only need that the user define the param_groups and it will automatically handle a lot of common functionality, like freezing up to a specific layer group, or defining a schedule with different parameters for each group.

This will also be very helpful when we are switching phases (e.g. finished training the head -> start training the entire model). In this scenario we normally need to reset the schedulers and redefine the learning rates.

It's true that the user can do all of this by hand, that is always possible, but this together with some other classes (that I'll point out soon) are going to eliminate a lot of boiler plate.

lgvaz on 30 May 2020

This is the code I used once while implementing differential learning rates for a BERT model.

def configure_optimizers(self):
    params = list(model.named_parameters())

    def is_backbone(n): return 'bert' in n

    grouped_parameters = [
        {"params": [p for n, p in params if is_backbone(n)], 'lr': args.lr},
        {"params": [p for n, p in params if not is_backbone(n)], 'lr': args.lr * 100},
    ]

    optimizer = torch.optim.AdamW(
        grouped_parameters, lr=args.lr, weight_decay=0
    )

    return optimizer

rohitgr7 on 30 May 2020

❤3

@rohitgr7
thoughts on this?

https://github.com/PyTorchLightning/pytorch-lightning/pull/2007#issuecomment-636271555

@lgvaz let's flesh out the full API and some examples here before making code changes /PR since this needs to be thought out more

williamFalcon on 30 May 2020

@williamFalcon I think freeze and unfreeze should be a method in Trainer or LightningModule (or a ModuleMixin) instead of a scheduler because it gives more fine-grain control since I can now freeze or unfreeze different param groups according to my need and I can decide the number of epochs I want to fine-tune the model for after analyzing the results of training the head first. In think in case of the scheduler you are expecting to do trainer.fit only once so that after some epochs the backbone will get unfreezed when it goes over the epochs that we specified in the FineTuneScheduler. But I was thinking to do transfer learning and fine-tuning in two different phases. Something like:

Trainer.fit in phase 1 to train the head.
freeze or unfreeze using the suggested method and update hyperparams(LR, betas).
Trainer.fit again to fine-tune in phase 2.

Fine-tuning gives better results but no always. If we specify the epoch and pass a FineTuneScheduler, it will do the fine-tuning even if it's not required, although we are saving the models after epoch we won't lose the weights but still training the head and fine-tuning separately will a bit easy to work with, I think.

Also the freeze and unfreeze method should take both an integer or a list of positive integers. For eg. in case of freeze methods If it's a positive integer, say 2, then just freeze the last 2 param_groups, if it's a negative integer, say -2, then freeze the first 2 param groups and if it's a list of positive integers then freeze the corresponding param_groups.

Also, we can have a method to update hyperparams of all param_groups after training the head while training the model.

The exact workflow of this approach depends on where we want to specify these freeze and unfreeze methods, in Trainer or LightningModule.

rohitgr7 on 30 May 2020

I agree with @rohitgr7 in this one, I think giving the control of freeze/unfreezing to a scheduler is mixing responsibility.

I also think the freeze and unfreeze method should both take an integer, but for me is more natural to use the signs the other way, so for freezing the last two layers we would say freeze(-2), this would be more similar on how you index elements in a list.

Currently the freeze method also calls .eval, this is unnatural to me, any thoughts on chaging that?

lgvaz on 30 May 2020

❤1

Now, to the actual training loop, or "how to do the phase switches" (as I like to think).

I've a lot of bias on doing the training as @rohitgr7 suggested, with separate calls to fit, the most comfortable to me is:

model.freeze()
trainer.fit_one_cycle(model, n_epochs=2, lr=1e-3, pct_start=0.9)
model.unfreeze()
trainer.fit_one_cycle(mode, n_epochs=5, lr=slice(5e-6, 5e-4), pct_start=0.2)

This is exactly the flow on fastai, this way of training model is excellent for iterative training, like on a notebook or a REPL.

fit_one_cycle assumes that we are using the OneCycleLR scheduler, assumes that each call is a continuation of the last, and assumes we want to reset our schedule

When we pass a slice to lr we are asking for a interpolation of values between the trainable layer groups

lgvaz on 30 May 2020

👍2

Can we move the discussion about the training loop/phase switches to #2006 ? I'll add some more examples there

I would like this one to be about parameter groups and differential learning rates only, it got confusing because of the example I previously included in #2007 which dealt with the loop, heh, sorry 😅

lgvaz on 30 May 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.