I seem to be getting the following error each time I try to train with APEX/fp16 with BERT finetuning. It happened with my own scripts and I also see this with repository's standard finetune_on_pregenerated.py which was recently updated. The error diagnostics seem to indicate an issue with the FusedLayerNorm. To further confirm: doing a local mod where I replaced the definition of BertLayerNorm with
BertLayerNorm = torch.nn.LayerNorm
The change resolves this issue (while, in my case, not noticeably changing the performance).. Apex docs are a bit raw but the most recent set does not suggest to manually manipulate optimizers or layer definitions, perhaps we should just stick to the BertLayerNorm definition as described above?
Traceback (most recent call last):
File "ash3/tune_bert.py", line 101, in <module>
main(sys.argv[1:])
File "ash3/tune_bert.py", line 47, in main
pregenerate(init)
File "ash3/tune_bert.py", line 85, in pregenerate
finetune_on_pregenerated(tune_args)
File "/home/madvillain/gitlab/ai/ash3/ash3/finetuning/finetune_on_pregenerated.py", line 292, in main
outputs = model(input_ids, segment_ids, input_mask, lm_label_ids, is_next)
File "/home/madvillain/miniconda3/envs/ash3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/madvillain/miniconda3/envs/ash3/lib/python3.6/site-packages/pytorch_transformers/modeling_bert.py", line 785, in forward
prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output)
File "/home/madvillain/miniconda3/envs/ash3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/madvillain/miniconda3/envs/ash3/lib/python3.6/site-packages/pytorch_transformers/modeling_bert.py", line 533, in forward
prediction_scores = self.predictions(sequence_output)
File "/home/madvillain/miniconda3/envs/ash3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/madvillain/miniconda3/envs/ash3/lib/python3.6/site-packages/pytorch_transformers/modeling_bert.py", line 501, in forward
hidden_states = self.transform(hidden_states)
File "/home/madvillain/miniconda3/envs/ash3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/madvillain/miniconda3/envs/ash3/lib/python3.6/site-packages/pytorch_transformers/modeling_bert.py", line 483, in forward
hidden_states = self.LayerNorm(hidden_states)
File "/home/madvillain/miniconda3/envs/ash3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/madvillain/miniconda3/envs/ash3/lib/python3.6/site-packages/apex/normalization/fused_layer_norm.py", line 159, in forward
input, self.weight, self.bias, self.normalized_shape,self.eps)
File "/home/madvillain/miniconda3/envs/ash3/lib/python3.6/site-packages/apex/normalization/fused_layer_norm.py", line 25, in forward
input_, ctx.normalized_shape, weight_, bias_, ctx.eps)
RuntimeError: expected scalar type Half but found Float (data<c10::Half> at /home/madvillain/miniconda3/envs/ash3/lib/python3.6/site-packages/torch/include/ATen/core/TensorMethods.h:1386)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7f6af587edc5 in /home/madvillain/miniconda3/envs/ash3/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::Half* at::Tensor::data<c10::Half>() const + 0x2c6 (0x7f6abeb8aa36 in /home/madvillain/miniconda3/envs/ash3/lib/python3.6/site-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so)
frame #2: cuda_layer_norm(at::Tensor*, at::Tensor*, at::Tensor*, at::Tensor*, int, int, c10::ArrayRef<long>, at::Tensor*, at::Tensor*, double) + 0x3ed (0x7f6abeb87dcd in /home/madvillain/miniconda3/envs/ash3/lib/python3.6/site-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so)
frame #3: layer_norm_affine(at::Tensor, c10::ArrayRef<long>, at::Tensor, at::Tensor, double) + 0x27a (0x7f6abeb7985a in /home/madvillain/miniconda3/envs/ash3/lib/python3.6/site-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so)
frame #4: <unknown function> + 0x196c4 (0x7f6abeb866c4 in /home/madvillain/miniconda3/envs/ash3/lib/python3.6/site-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so)
frame #5: <unknown function> + 0x16e0a (0x7f6abeb83e0a in /home/madvillain/miniconda3/envs/ash3/lib/python3.6/site-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so)
<omitting python frames>
frame #12: THPFunction_apply(_object*, _object*) + 0x691 (0x7f6b24b0a081 in /home/madvillain/miniconda3/envs/ash3/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
Model I am using (Bert, XLNet....): BERT
Language I am using the model on (English, Chinese....): English
The problem arise when using:
The tasks I am working on is:
no failures
Yes, that's what we do now on master since #1089 (switching back to torch.nn.LayerNorm).
Thanks for reporting
@thomwolf yes, thank you for your response! I wanted to clarify; if I do fp16 I still see that master is doing
try:
from apex.normalization.fused_layer_norm import FusedLayerNorm as BertLayerNorm
except (ImportError, AttributeError) as e:
logger.info("Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .")
BertLayerNorm = torch.nn.LayerNorm
https://github.com/huggingface/pytorch-transformers/commit/bdb4409ed8de4d199907c75832398f2c49a564e1
and in my case FusedLayerNorm seem to cause the issue... so maybe we are talking about different things. Or did you mean that this is a work in progress and it was not merged to master yet?
Oh indeed, maybe it's a issue with finetune_on_pregenerated.py. The scripts in the lm_finetuning folder are in the process of being deprecated. You can try with the newly added run_lm_finetuning.py which is actively maintained.
setting --fp16_opt_level to O2 resolved that error for me.
@mksenzov I have the same exact issue. Was wondering if you figured it out?
I'm getting the same issue using an optimization level of "O1" while running run_lm_finetuning. is this expected? "O2" seems to work just fine.
The problem is that this model in O1 enters to FusedLayerNorm.forward with the input in half-precision but its parameters are still in single-precision, and apparently the kernel doesn't support different types (neither does PyTorch's nn.LayerNorm). In O2, in contrast, the parameters are changed to half so the issue doesn't occur.
I believe there's no reason that FusedLayerNorm should be called if apex is available because the user may want to disable apex use O1, but it's incompatible with it. On the contrary, nn.LayerNorm is blacklisted in the amp initialization, so its input will always be float32 in O1, while FusedLayerNorm is not blacklisted.
Plus, nn.LayerNorm is probably fused and proved to be faster on a V100 to me with both float32 and float16.
Could we also remove the FusedLayerNorm call in modeling_xlnet?
Most helpful comment
The problem is that this model in O1 enters to
FusedLayerNorm.forwardwith the input in half-precision but its parameters are still in single-precision, and apparently the kernel doesn't support different types (neither does PyTorch'snn.LayerNorm). In O2, in contrast, the parameters are changed to half so the issue doesn't occur.I believe there's no reason that
FusedLayerNormshould be called if apex is available because the user may want to disable apex use O1, but it's incompatible with it. On the contrary,nn.LayerNormis blacklisted in the amp initialization, so its input will always be float32 in O1, whileFusedLayerNormis not blacklisted.Plus,
nn.LayerNormis probably fused and proved to be faster on a V100 to me with both float32 and float16.