Is your feature request related to a problem? Please describe.
Often during a training, loss changes really rapidly first few epochs, so, at this time, we don't really need to use a gradient accumulation. And sometimes we want to schedule changing of accumulation factor, for example, increase it every 10 epochs
Describe the solution you'd like
Let's define scheduler, what will control the changing of accumulation factor
schedule = {6:2, 11:4}
accumulator = GradientAccumulationScheduler(schedule)
According to this schedule, we will fit our model first 5 epochs with factor 1. next 5 with factor 2, and next epochs a factor will be 4
Describe alternatives you've considered
1) We can interrupt model finning and manually change factor, but its really not user-friendly
2) We can override on_epoch_begin in pl_model, and changing factor this way.
Additional context
I have heard about this technique at ML Training from one of the competition winners, so it could be a useful feature
@stas6626 thanks for the PR.
I'm not sure we need a callback for this. What is the advantage over allowing the trainer flag to take on that format?
Trainer(accumulated_gradients='1:2, ...')
Also, i haven't heard of anyone doing it this way. Can you point to examples? Not sure this is a best practice yet
If you are not sure about callback, let's discuss further implementation. So, we may pass dict to accumulated_gradients in format {epoch: accumulation_factor}, and control factors changing by a method of Trainer? Any other suggestions?
@stas6626 that paper just shows that decreasing the learning rate and increasing batch size seem to be equivalent (seem to be because the openreview comments aren't great).
In practice most use lr scheduling (not batch size increase), as we're always trying to max out batch size with memory bound.
The accumulated gradients option in Lightning is meant for cases when you're running up against RAM requirements but need an effectively large batch size. It's not meant as an alternative to LR scheduling.
Edit
Spoke with labmates working on BERT, and latest NLP models and it seems accumulated gradient scheduling is super useful there....
Let's do what you suggested, pass a dict to accumulated_gradients
trainer(accumulated_gradients={5: 2, 10: 4})
So, what about implementation details? 馃槀
and define a method for accumulation factor changing, right?
@stas6626 lol... edited my answer haha... for implementation maybe just use the callback you wrote internally?
class Trainer...
def __init__(accumulated_gradients):
if isinstance(accumulated_gradients, dict):
self.accumulated_callback = GradientAccumulationScheduler(accumulated_gradients)
so, the external API stays the same,
accumulated_gradients param is smarter
internally use the scheduler.
if it ever gets weird with the dict, it's an easy fix to pull out the scheduler.
ok, get it. I will do it till the end of this week 馃槉
great addition! thanks for adding.
@stas6626 awesome work. Merged!