Hey,
In particular, it's this line of code: https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L305
After running attention scores through softmax, some (but not all) scores get set to NaN. I follow the Amp guide pretty much letter for letter in terms of setting up the model, but can post any code you might think is relevant. I had this code training successfully (on a different machine) with the old API, so maybe this is something to do with how Softmax is being patched in the new implementation?
Thanks!
O2 doesn't do any patching at all, and also does not correspond to the thing called "amp" in the old API. O1 is what corresponds to the old thing called "amp," and O1 should be patching softmax to run in FP32. Do you still see nans with O1?
The softmax forward pass should be benign, since it produces values between 0 and 1 and the underlying Pytorch implementation is pretty smart. The backward pass may have dynamic range issues because the gradient wrt each input element requires a reduction.
Ah okay, the patching makes sense. I'm not sure but it seems like for now I'm not getting NaN output anymore.
@dave-epstein I got the similar problem. How did you solve the problem getting NaN?
Most helpful comment
@dave-epstein I got the similar problem. How did you solve the problem getting NaN?