When automatic_optimization = False and terminate_on_nan = True, an exception is raised when checking for nan values. This is due to None being passed in as the value for loss to self.detect_nan_tensors. It looks like the code on master has already changed from what I'm seeing in 1.0.3, so I don't know if this has somehow been fixed or not. The problem seems to be that the AttributeDict returned from train_step has loss=None.
https://colab.research.google.com/drive/1qQmP6BwQk--rBXC7W45y0mn6QK39IPcc
Don't crash when automatic_optimization = False and terminate_on_nan = True
Hi! thanks for your contribution!, great first issue!
I discovered this because the loss was showing up as nan in the progress bar, and I was trying to figure out why I was getting nan. I dug some more, and it looks like this is itself a bug. I've inspected the loss and network parameters over several steps, and there are no nans. So there seems to be a problem in the logging somewhere, that if you're using automatic_optimization=False you get nan being logged as the loss in the progress bar.
Thanks @catalys1 you are correct, however recent changes should have resolved this issue since the nan check only runs if using automatic optimization:
In #4204 we'll make it clearer that you should report values within the training step via the docs :)
Most helpful comment
I discovered this because the loss was showing up as
nanin the progress bar, and I was trying to figure out why I was gettingnan. I dug some more, and it looks like this is itself a bug. I've inspected the loss and network parameters over several steps, and there are nonans. So there seems to be a problem in the logging somewhere, that if you're usingautomatic_optimization=Falseyou getnanbeing logged as the loss in the progress bar.