Pytorch-lightning: Problems with automatic_optimization=False

Created on 22 Oct 2020  ·  3Comments  ·  Source: PyTorchLightning/pytorch-lightning

🐛 Bug


When automatic_optimization = False and terminate_on_nan = True, an exception is raised when checking for nan values. This is due to None being passed in as the value for loss to self.detect_nan_tensors. It looks like the code on master has already changed from what I'm seeing in 1.0.3, so I don't know if this has somehow been fixed or not. The problem seems to be that the AttributeDict returned from train_step has loss=None.

Please reproduce using the BoringModel and post here


https://colab.research.google.com/drive/1qQmP6BwQk--rBXC7W45y0mn6QK39IPcc

Expected behavior

Don't crash when automatic_optimization = False and terminate_on_nan = True

Logger bug / fix help wanted

Most helpful comment

I discovered this because the loss was showing up as nan in the progress bar, and I was trying to figure out why I was getting nan. I dug some more, and it looks like this is itself a bug. I've inspected the loss and network parameters over several steps, and there are no nans. So there seems to be a problem in the logging somewhere, that if you're using automatic_optimization=False you get nan being logged as the loss in the progress bar.

All 3 comments

Hi! thanks for your contribution!, great first issue!

I discovered this because the loss was showing up as nan in the progress bar, and I was trying to figure out why I was getting nan. I dug some more, and it looks like this is itself a bug. I've inspected the loss and network parameters over several steps, and there are no nans. So there seems to be a problem in the logging somewhere, that if you're using automatic_optimization=False you get nan being logged as the loss in the progress bar.

Thanks @catalys1 you are correct, however recent changes should have resolved this issue since the nan check only runs if using automatic optimization:

https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/training_loop.py#L779-L789

In #4204 we'll make it clearer that you should report values within the training step via the docs :)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

versatran01 picture versatran01  ·  3Comments

as754770178 picture as754770178  ·  3Comments

williamFalcon picture williamFalcon  ·  3Comments

srush picture srush  ·  3Comments

Vichoko picture Vichoko  ·  3Comments