Pytorch-lightning: Problems with automatic_optimization=False

Created on 22 Oct 2020 · 3Comments · Source: PyTorchLightning/pytorch-lightning

🐛 Bug

When automatic_optimization = False and terminate_on_nan = True, an exception is raised when checking for nan values. This is due to None being passed in as the value for loss to self.detect_nan_tensors. It looks like the code on master has already changed from what I'm seeing in 1.0.3, so I don't know if this has somehow been fixed or not. The problem seems to be that the AttributeDict returned from train_step has loss=None.

Please reproduce using the BoringModel and post here

https://colab.research.google.com/drive/1qQmP6BwQk--rBXC7W45y0mn6QK39IPcc

Expected behavior

Don't crash when automatic_optimization = False and terminate_on_nan = True

Logger bug / fix help wanted

Source

catalys1

Most helpful comment

I discovered this because the loss was showing up as nan in the progress bar, and I was trying to figure out why I was getting nan. I dug some more, and it looks like this is itself a bug. I've inspected the loss and network parameters over several steps, and there are no nans. So there seems to be a problem in the logging somewhere, that if you're using automatic_optimization=False you get nan being logged as the loss in the progress bar.

catalys1 on 22 Oct 2020

👍2

All 3 comments

Hi! thanks for your contribution!, great first issue!

github-actions[bot] on 22 Oct 2020

catalys1 on 22 Oct 2020

👍2

Thanks @catalys1 you are correct, however recent changes should have resolved this issue since the nan check only runs if using automatic optimization:

https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/training_loop.py#L779-L789

In #4204 we'll make it clearer that you should report values within the training step via the docs :)