Pytorch-lightning: Gradient clipping: norm is always 2, device might be undefined.

Created on 31 Jul 2020 · 8Comments · Source: PyTorchLightning/pytorch-lightning

🐛 Bug

I just re-skimmed the gradient clipping operation and found some inconsistency:
(1) device is only defined for non-'inf' norms, see here. While fixing this, I realized that
(2) norm_type is always 2.

So the two options I see are:
(1) Keep the existing behavior and remove the inf branch --> always apply 2-norm.
(2) Expose norm_type as a trainer argument.
Which option do you prefer?

I've also assumed that test_grad_norm.py tests gradient_clipping but it actually tests gradient norm tracking, so
I'd add a corresponding test case to the PR to fix the issue above ;)

(Skipping reproducibility and version info as this is easy to observe from looking at the code :))

API / design discussion enhancement help wanted won't fix

Source

PhilJd

Most helpful comment

In my opinion, gradient clipping and gradient norm tracking and these things should be callbacks. Like the model checkpoint and earlystopping, they can have various arguments for the advanced user, e.g. here the norm type, but exposing such options as Trainer args I do not like at all.

awaelchli on 1 Aug 2020

🚀3 👍2

All 8 comments

I would personally take option 2, but let's talk about is a bit more as the 2 mean API change...
@PyTorchLightning/core-contributors

Borda on 31 Jul 2020

awaelchli on 1 Aug 2020

🚀3 👍2

@PhilJd mid send a PR with implementing it as callback?

Borda on 2 Aug 2020

I'll look into it!

PhilJd on 3 Aug 2020

🎉1

Okay, after taking a look at how Callbacks are implemented I'm not sure what you exactly had in mind, could you outline this in more detail?
More specifically, as far as I understand it, gradient clipping needs to happen after potentially unscaling gradients in mixed precision training but before the optimizer step. I don't see a corresponding entry hook for this in callbacks, Callbacks.on_batch_end is too late as the optimizer step has happened already. The situation for norm tracking is similar (zero_grad is called before batch_end is triggered.)

PhilJd on 3 Aug 2020

okay I see. Sorry if I haven't thought it through carefully and let you discover that, I just thought it would be very clean to implement it like this.

Mabe adding a on_before/after_backward callback method would make sense (for general use by callbacks)? Then this could be used to do the clipping?

awaelchli on 3 Aug 2020

@williamFalcon

edenlightning on 3 Aug 2020

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!