Pytorch-lightning: Gradient Accumulation Scheduler

Created on 19 Aug 2019 · 11Comments · Source: PyTorchLightning/pytorch-lightning

Is your feature request related to a problem? Please describe.
Often during a training, loss changes really rapidly first few epochs, so, at this time, we don't really need to use a gradient accumulation. And sometimes we want to schedule changing of accumulation factor, for example, increase it every 10 epochs

Describe the solution you'd like
Let's define scheduler, what will control the changing of accumulation factor
schedule = {6:2, 11:4}
accumulator = GradientAccumulationScheduler(schedule)
According to this schedule, we will fit our model first 5 epochs with factor 1. next 5 with factor 2, and next epochs a factor will be 4

Describe alternatives you've considered
1) We can interrupt model finning and manually change factor, but its really not user-friendly
2) We can override on_epoch_begin in pl_model, and changing factor this way.

Additional context
I have heard about this technique at ML Training from one of the competition winners, so it could be a useful feature

enhancement help wanted

Source

stas6626

All 11 comments

@stas6626 thanks for the PR.
I'm not sure we need a callback for this. What is the advantage over allowing the trainer flag to take on that format?

Trainer(accumulated_gradients='1:2, ...')
Also, i haven't heard of anyone doing it this way. Can you point to examples? Not sure this is a best practice yet

williamFalcon on 21 Aug 2019

👍1

@williamFalcon https://arxiv.org/abs/1711.00489
https://www.kaggle.com/c/cdiscount-image-classification-challenge/discussion/40780#latest-427598

stas6626 on 21 Aug 2019

If you are not sure about callback, let's discuss further implementation. So, we may pass dict to accumulated_gradients in format {epoch: accumulation_factor}, and control factors changing by a method of Trainer? Any other suggestions?

stas6626 on 21 Aug 2019

@stas6626 that paper just shows that decreasing the learning rate and increasing batch size seem to be equivalent (seem to be because the openreview comments aren't great).

In practice most use lr scheduling (not batch size increase), as we're always trying to max out batch size with memory bound.

The accumulated gradients option in Lightning is meant for cases when you're running up against RAM requirements but need an effectively large batch size. It's not meant as an alternative to LR scheduling.

Edit
Spoke with labmates working on BERT, and latest NLP models and it seems accumulated gradient scheduling is super useful there....

Let's do what you suggested, pass a dict to accumulated_gradients

trainer(accumulated_gradients={5: 2, 10: 4})

williamFalcon on 21 Aug 2019

So, what about implementation details? 😂

stas6626 on 21 Aug 2019

and define a method for accumulation factor changing, right?

stas6626 on 21 Aug 2019

@stas6626 lol... edited my answer haha... for implementation maybe just use the callback you wrote internally?


class Trainer...

    def __init__(accumulated_gradients):    
            if isinstance(accumulated_gradients, dict):
                     self.accumulated_callback = GradientAccumulationScheduler(accumulated_gradients)

williamFalcon on 21 Aug 2019

so, the external API stays the same,
accumulated_gradients param is smarter
internally use the scheduler.

if it ever gets weird with the dict, it's an easy fix to pull out the scheduler.

williamFalcon on 21 Aug 2019

ok, get it. I will do it till the end of this week 😊

stas6626 on 21 Aug 2019

👍1

great addition! thanks for adding.

williamFalcon on 21 Aug 2019

❤1

@stas6626 awesome work. Merged!

williamFalcon on 30 Aug 2019