Pytorch-lightning: Differential learning rates and parameter groups

Created on 29 May 2020  路  10Comments  路  Source: PyTorchLightning/pytorch-lightning

馃殌 Feature

Implement a class (possibly a ModuleMixin) that makes it easy for the user to define parameter groups (PGs) for his module.
Once we have the PGs we can start semi-automatically handling the differential learning rates (being left to the user to define it's values)

Motivation

Improve the transfer learning workflow

Pitch

The usage of differential learning rates is essential when working with transfer learning

Additional context

One implemented, differential learning rates should look something like this:
image

In the above image each parameter group follows the OneCycleLR schedule, but with different max_lr. Earlier layers train with a lower learning rate compared to new ones

enhancement help wanted won't fix

Most helpful comment

This is the code I used once while implementing differential learning rates for a BERT model.

def configure_optimizers(self):
    params = list(model.named_parameters())

    def is_backbone(n): return 'bert' in n

    grouped_parameters = [
        {"params": [p for n, p in params if is_backbone(n)], 'lr': args.lr},
        {"params": [p for n, p in params if not is_backbone(n)], 'lr': args.lr * 100},
    ]

    optimizer = torch.optim.AdamW(
        grouped_parameters, lr=args.lr, weight_decay=0
    )

    return optimizer

Now I think a Class would be a better option if we want freezing and unfreezing functionalities for param groups. A method something like update_group_params_hyperparams within that class can also be a good choice to update the hyperparams for each param_group and not just learning rate.

All 10 comments

Differential learning rates can easily be implemented within configure_optimizers using param_groups. Why a new class is required??

@rohitgr7 can you give an example?

Differential learning rates are such a common necessity when doing transfer learning that would be great to have a helper class to work with that.

This class will only need that the user define the param_groups and it will automatically handle a lot of common functionality, like freezing up to a specific layer group, or defining a schedule with different parameters for each group.

This will also be very helpful when we are switching phases (e.g. finished training the head -> start training the entire model). In this scenario we normally need to reset the schedulers and redefine the learning rates.

It's true that the user can do all of this by hand, that is always possible, but this together with some other classes (that I'll point out soon) are going to eliminate a lot of boiler plate.

This is the code I used once while implementing differential learning rates for a BERT model.

def configure_optimizers(self):
    params = list(model.named_parameters())

    def is_backbone(n): return 'bert' in n

    grouped_parameters = [
        {"params": [p for n, p in params if is_backbone(n)], 'lr': args.lr},
        {"params": [p for n, p in params if not is_backbone(n)], 'lr': args.lr * 100},
    ]

    optimizer = torch.optim.AdamW(
        grouped_parameters, lr=args.lr, weight_decay=0
    )

    return optimizer

Now I think a Class would be a better option if we want freezing and unfreezing functionalities for param groups. A method something like update_group_params_hyperparams within that class can also be a good choice to update the hyperparams for each param_group and not just learning rate.

@rohitgr7
thoughts on this?

https://github.com/PyTorchLightning/pytorch-lightning/pull/2007#issuecomment-636271555

@lgvaz let's flesh out the full API and some examples here before making code changes /PR since this needs to be thought out more

@williamFalcon I think freeze and unfreeze should be a method in Trainer or LightningModule (or a ModuleMixin) instead of a scheduler because it gives more fine-grain control since I can now freeze or unfreeze different param groups according to my need and I can decide the number of epochs I want to fine-tune the model for after analyzing the results of training the head first. In think in case of the scheduler you are expecting to do trainer.fit only once so that after some epochs the backbone will get unfreezed when it goes over the epochs that we specified in the FineTuneScheduler. But I was thinking to do transfer learning and fine-tuning in two different phases. Something like:

  1. Trainer.fit in phase 1 to train the head.
  2. freeze or unfreeze using the suggested method and update hyperparams(LR, betas).
  3. Trainer.fit again to fine-tune in phase 2.

Fine-tuning gives better results but no always. If we specify the epoch and pass a FineTuneScheduler, it will do the fine-tuning even if it's not required, although we are saving the models after epoch we won't lose the weights but still training the head and fine-tuning separately will a bit easy to work with, I think.

Also the freeze and unfreeze method should take both an integer or a list of positive integers. For eg. in case of freeze methods If it's a positive integer, say 2, then just freeze the last 2 param_groups, if it's a negative integer, say -2, then freeze the first 2 param groups and if it's a list of positive integers then freeze the corresponding param_groups.

Also, we can have a method to update hyperparams of all param_groups after training the head while training the model.

The exact workflow of this approach depends on where we want to specify these freeze and unfreeze methods, in Trainer or LightningModule.

I agree with @rohitgr7 in this one, I think giving the control of freeze/unfreezing to a scheduler is mixing responsibility.

I also think the freeze and unfreeze method should both take an integer, but for me is more natural to use the signs the other way, so for freezing the last two layers we would say freeze(-2), this would be more similar on how you index elements in a list.

Currently the freeze method also calls .eval, this is unnatural to me, any thoughts on chaging that?

Now, to the actual training loop, or "how to do the phase switches" (as I like to think).

I've a lot of bias on doing the training as @rohitgr7 suggested, with separate calls to fit, the most comfortable to me is:

model.freeze()
trainer.fit_one_cycle(model, n_epochs=2, lr=1e-3, pct_start=0.9)
model.unfreeze()
trainer.fit_one_cycle(mode, n_epochs=5, lr=slice(5e-6, 5e-4), pct_start=0.2)

This is exactly the flow on fastai, this way of training model is excellent for iterative training, like on a notebook or a REPL.

fit_one_cycle assumes that we are using the OneCycleLR scheduler, assumes that each call is a continuation of the last, and assumes we want to reset our schedule

When we pass a slice to lr we are asking for a interpolation of values between the trainable layer groups

Can we move the discussion about the training loop/phase switches to #2006 ? I'll add some more examples there

I would like this one to be about parameter groups and differential learning rates only, it got confusing because of the example I previously included in #2007 which dealt with the loop, heh, sorry 馃槄

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

baeseongsu picture baeseongsu  路  3Comments

edenlightning picture edenlightning  路  3Comments

justusschock picture justusschock  路  3Comments

polars05 picture polars05  路  3Comments

iakremnev picture iakremnev  路  3Comments