Transformers: T5Model in fp16 still yield nan with more complex examples

Created on 26 May 2020  路  7Comments  路  Source: huggingface/transformers

馃悰 Bug

Hello, thank you for the recent PR with fp16 fixes. It seems to work well with short inputs, but once the model is fed with some more complex data it still yields nans.

Information

Model I am using: T5

Language I am using the model on: English

The problem arises when using:

  • [ ] the official example scripts: (give details below)
  • [x] my own modified scripts: (give details below)

The tasks I am working on is:

  • [ ] an official GLUE/SQUaD task: (give the name)
  • [x] my own task or dataset: (give details below)

To reproduce

Run the code:

from transformers import T5Model
import torch

model = T5Model.from_pretrained("t5-base").cuda().half().eval()
inputs = torch.tensor([[37,423,215,1504,13,8,1186,10670,11,10449,49,1152,11363,15465,1514,5,4433,399,7863,24766,15,17,965,594,5386,14286,28,8,6,5,755,5781,32099,993,3744,21,8,2367,18,458,53,16616,32098,16,32097,7660,16409,77,19,3,107,13164,1054,32096,993,1970,9368,948,147,8,15465,5861,87,25481,788,12,8,32095,1300,61,37,423,215,1504,13,3,24151,40,3,19668,594,5386,14286,28,8,3,115,13164]]).cuda()
decoder_input_ids = torch.tensor([[21820, 296, 55]]).cuda()

out = model(input_ids=inputs, decoder_input_ids=decoder_input_ids)
# encoder outputs
out[2][:,:2]

output:

tensor([[[nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan]]], device='cuda:0',
       dtype=torch.float16, grad_fn=<SliceBackward>)

Expected behavior

Output with non-nan values.

Environment info

  • transformers version: 2.10.0
  • Platform: Linux-4.15.0-88-generic-x86_64-with-debian-buster-sid
  • Python version: 3.6.10
  • PyTorch version (GPU?): 1.4.0 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: no
wontfix

Most helpful comment

@patrickvonplaten Even with O1 I tried fine-tuning T5-base, and in less than 100 iterations, it will converge to nan values quickly. Seems like the stability of this model is poor. Perhaps first few iterations of fine-tuning require FP32.

All 7 comments

I got the same issue - seems to happen with the larger models (t5 small is fine)

I can reproduce the error - will investigate :-)

Okey this took me quite some time to figure out...

So what happens is the following. When setting all modules in half as is done in the code snippet above, the following happens. At some point in line:
https://github.com/huggingface/transformers/blob/acaa2e6267ebfda9814795fa00b6ad86c35ea5d6/src/transformers/modeling_t5.py#L188
the tensor layer_output contains inf values and then later in:
https://github.com/huggingface/transformers/blob/acaa2e6267ebfda9814795fa00b6ad86c35ea5d6/src/transformers/modeling_t5.py#L156
nan values enter the game...

I don't really think this is a bug in T5, but it's just due to T5's rather unstable architecture. model.half() essentially corresponds to an apex level O3: https://nvidia.github.io/apex/amp.html#o3-fp16-training which in itself tends to become unstable...

So using your code above and using the apex package instead of calling half() on the model, you can notice the following. The code snippet which is essentially the same as yours:

from transformers import T5Model
from apex import amp
import torch

model = T5Model.from_pretrained("t5-base").cuda().eval()
model = amp.initialize(model, opt_level="O3") 

inputs = torch.tensor([[37,423,215,1504,13,8,1186,10670,11,10449,49,1152,11363,15465,1514,5,4433,399,7863,24766,15,17,965,594,5386,14286,28,8,6,5,755,5781,32099,993,3744,21,8,2367,18,458,53,16616,32098,16,32097,7660,16409,77,19,3,107,13164,1054,32096,993,1970,9368,948,147,8,15465,5861,87,25481,788,12,8,32095,1300,61,37,423,215,1504,13,3,24151,40,3,19668,594,5386,14286,28,8,3,115,13164]]).cuda()
decoder_input_ids = torch.tensor([[21820, 296, 55]]).cuda()

out = model(input_ids=inputs, decoder_input_ids=decoder_input_ids)
# encoder outputs
out[2][:,:2]  # nan output

yields the same output consisting of nan values. The same happens for opt_level O2.
Using the recommended O1 level of optimization:

from transformers import T5Model
from apex import amp
import torch

model = T5Model.from_pretrained("t5-base").cuda().eval()
model = amp.initialize(model, opt_level="O1") 

inputs = torch.tensor([[37,423,215,1504,13,8,1186,10670,11,10449,49,1152,11363,15465,1514,5,4433,399,7863,24766,15,17,965,594,5386,14286,28,8,6,5,755,5781,32099,993,3744,21,8,2367,18,458,53,16616,32098,16,32097,7660,16409,77,19,3,107,13164,1054,32096,993,1970,9368,948,147,8,15465,5861,87,25481,788,12,8,32095,1300,61,37,423,215,1504,13,3,24151,40,3,19668,594,5386,14286,28,8,3,115,13164]]).cuda()
decoder_input_ids = torch.tensor([[21820, 296, 55]]).cuda()

out = model(input_ids=inputs, decoder_input_ids=decoder_input_ids)
# encoder outputs
out[2][:,:2]  # valid output

however does not produce any nan values. As far as I know O1 is also the recommended setting: https://nvidia.github.io/apex/amp.html#o1-mixed-precision-recommended-for-typical-use .
As far as I know O1 can already greatly speed up your calculations and save quite some memory, so that I would recommend going for this.

Also pinging @mfuntowicz, @julien-c and @LysandreJik for verification

@patrickvonplaten Even with O1 I tried fine-tuning T5-base, and in less than 100 iterations, it will converge to nan values quickly. Seems like the stability of this model is poor. Perhaps first few iterations of fine-tuning require FP32.

~I am having issues even in fp32 with everything besides t5-small.~
I am having issues in O1 with t5-large and t5-base.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Having the same issue with loss going to nan when fine-tuning tf-base with fp16. tf-small works fine though.

Was this page helpful?
0 / 5 - 0 ratings