Pytorch-lightning: Problems with automatic_optimization=False

Created on 22 Oct 2020  路  3Comments  路  Source: PyTorchLightning/pytorch-lightning

馃悰 Bug


When automatic_optimization = False and terminate_on_nan = True, an exception is raised when checking for nan values. This is due to None being passed in as the value for loss to self.detect_nan_tensors. It looks like the code on master has already changed from what I'm seeing in 1.0.3, so I don't know if this has somehow been fixed or not. The problem seems to be that the AttributeDict returned from train_step has loss=None.

Please reproduce using the BoringModel and post here


https://colab.research.google.com/drive/1qQmP6BwQk--rBXC7W45y0mn6QK39IPcc

Expected behavior

Don't crash when automatic_optimization = False and terminate_on_nan = True

Logger bug / fix help wanted

Most helpful comment

I discovered this because the loss was showing up as nan in the progress bar, and I was trying to figure out why I was getting nan. I dug some more, and it looks like this is itself a bug. I've inspected the loss and network parameters over several steps, and there are no nans. So there seems to be a problem in the logging somewhere, that if you're using automatic_optimization=False you get nan being logged as the loss in the progress bar.

All 3 comments

Hi! thanks for your contribution!, great first issue!

I discovered this because the loss was showing up as nan in the progress bar, and I was trying to figure out why I was getting nan. I dug some more, and it looks like this is itself a bug. I've inspected the loss and network parameters over several steps, and there are no nans. So there seems to be a problem in the logging somewhere, that if you're using automatic_optimization=False you get nan being logged as the loss in the progress bar.

Thanks @catalys1 you are correct, however recent changes should have resolved this issue since the nan check only runs if using automatic optimization:

https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/training_loop.py#L779-L789

In #4204 we'll make it clearer that you should report values within the training step via the docs :)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

maxime-louis picture maxime-louis  路  3Comments

anthonytec2 picture anthonytec2  路  3Comments

DavidRuhe picture DavidRuhe  路  3Comments

iakremnev picture iakremnev  路  3Comments

srush picture srush  路  3Comments