Hello, thank you for the recent PR with fp16 fixes. It seems to work well with short inputs, but once the model is fed with some more complex data it still yields nans.
Model I am using: T5
Language I am using the model on: English
The problem arises when using:
The tasks I am working on is:
Run the code:
from transformers import T5Model
import torch
model = T5Model.from_pretrained("t5-base").cuda().half().eval()
inputs = torch.tensor([[37,423,215,1504,13,8,1186,10670,11,10449,49,1152,11363,15465,1514,5,4433,399,7863,24766,15,17,965,594,5386,14286,28,8,6,5,755,5781,32099,993,3744,21,8,2367,18,458,53,16616,32098,16,32097,7660,16409,77,19,3,107,13164,1054,32096,993,1970,9368,948,147,8,15465,5861,87,25481,788,12,8,32095,1300,61,37,423,215,1504,13,3,24151,40,3,19668,594,5386,14286,28,8,3,115,13164]]).cuda()
decoder_input_ids = torch.tensor([[21820, 296, 55]]).cuda()
out = model(input_ids=inputs, decoder_input_ids=decoder_input_ids)
# encoder outputs
out[2][:,:2]
output:
tensor([[[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]]], device='cuda:0',
dtype=torch.float16, grad_fn=<SliceBackward>)
Output with non-nan values.
transformers version: 2.10.0I got the same issue - seems to happen with the larger models (t5 small is fine)
I can reproduce the error - will investigate :-)
Okey this took me quite some time to figure out...
So what happens is the following. When setting all modules in half as is done in the code snippet above, the following happens. At some point in line:
https://github.com/huggingface/transformers/blob/acaa2e6267ebfda9814795fa00b6ad86c35ea5d6/src/transformers/modeling_t5.py#L188
the tensor layer_output contains inf values and then later in:
https://github.com/huggingface/transformers/blob/acaa2e6267ebfda9814795fa00b6ad86c35ea5d6/src/transformers/modeling_t5.py#L156
nan values enter the game...
I don't really think this is a bug in T5, but it's just due to T5's rather unstable architecture. model.half() essentially corresponds to an apex level O3: https://nvidia.github.io/apex/amp.html#o3-fp16-training which in itself tends to become unstable...
So using your code above and using the apex package instead of calling half() on the model, you can notice the following. The code snippet which is essentially the same as yours:
from transformers import T5Model
from apex import amp
import torch
model = T5Model.from_pretrained("t5-base").cuda().eval()
model = amp.initialize(model, opt_level="O3")
inputs = torch.tensor([[37,423,215,1504,13,8,1186,10670,11,10449,49,1152,11363,15465,1514,5,4433,399,7863,24766,15,17,965,594,5386,14286,28,8,6,5,755,5781,32099,993,3744,21,8,2367,18,458,53,16616,32098,16,32097,7660,16409,77,19,3,107,13164,1054,32096,993,1970,9368,948,147,8,15465,5861,87,25481,788,12,8,32095,1300,61,37,423,215,1504,13,3,24151,40,3,19668,594,5386,14286,28,8,3,115,13164]]).cuda()
decoder_input_ids = torch.tensor([[21820, 296, 55]]).cuda()
out = model(input_ids=inputs, decoder_input_ids=decoder_input_ids)
# encoder outputs
out[2][:,:2] # nan output
yields the same output consisting of nan values. The same happens for opt_level O2.
Using the recommended O1 level of optimization:
from transformers import T5Model
from apex import amp
import torch
model = T5Model.from_pretrained("t5-base").cuda().eval()
model = amp.initialize(model, opt_level="O1")
inputs = torch.tensor([[37,423,215,1504,13,8,1186,10670,11,10449,49,1152,11363,15465,1514,5,4433,399,7863,24766,15,17,965,594,5386,14286,28,8,6,5,755,5781,32099,993,3744,21,8,2367,18,458,53,16616,32098,16,32097,7660,16409,77,19,3,107,13164,1054,32096,993,1970,9368,948,147,8,15465,5861,87,25481,788,12,8,32095,1300,61,37,423,215,1504,13,3,24151,40,3,19668,594,5386,14286,28,8,3,115,13164]]).cuda()
decoder_input_ids = torch.tensor([[21820, 296, 55]]).cuda()
out = model(input_ids=inputs, decoder_input_ids=decoder_input_ids)
# encoder outputs
out[2][:,:2] # valid output
however does not produce any nan values. As far as I know O1 is also the recommended setting: https://nvidia.github.io/apex/amp.html#o1-mixed-precision-recommended-for-typical-use .
As far as I know O1 can already greatly speed up your calculations and save quite some memory, so that I would recommend going for this.
Also pinging @mfuntowicz, @julien-c and @LysandreJik for verification
@patrickvonplaten Even with O1 I tried fine-tuning T5-base, and in less than 100 iterations, it will converge to nan values quickly. Seems like the stability of this model is poor. Perhaps first few iterations of fine-tuning require FP32.
~I am having issues even in fp32 with everything besides t5-small.~
I am having issues in O1 with t5-large and t5-base.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Having the same issue with loss going to nan when fine-tuning tf-base with fp16. tf-small works fine though.
Most helpful comment
@patrickvonplaten Even with O1 I tried fine-tuning T5-base, and in less than 100 iterations, it will converge to nan values quickly. Seems like the stability of this model is poor. Perhaps first few iterations of fine-tuning require FP32.