Apex: NaNs after torch.nn.Softmax with Amp O2 model?

Created on 27 Mar 2019  路  3Comments  路  Source: NVIDIA/apex

Hey,

In particular, it's this line of code: https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L305

After running attention scores through softmax, some (but not all) scores get set to NaN. I follow the Amp guide pretty much letter for letter in terms of setting up the model, but can post any code you might think is relevant. I had this code training successfully (on a different machine) with the old API, so maybe this is something to do with how Softmax is being patched in the new implementation?

Thanks!

Most helpful comment

@dave-epstein I got the similar problem. How did you solve the problem getting NaN?

All 3 comments

O2 doesn't do any patching at all, and also does not correspond to the thing called "amp" in the old API. O1 is what corresponds to the old thing called "amp," and O1 should be patching softmax to run in FP32. Do you still see nans with O1?

The softmax forward pass should be benign, since it produces values between 0 and 1 and the underlying Pytorch implementation is pretty smart. The backward pass may have dynamic range issues because the gradient wrt each input element requires a reduction.

Ah okay, the patching makes sense. I'm not sure but it seems like for now I'm not getting NaN output anymore.

@dave-epstein I got the similar problem. How did you solve the problem getting NaN?

Was this page helpful?
0 / 5 - 0 ratings