Transformers: apex fp16 FusedLayerNorm type issues

Created on 1 Sep 2019 · 8Comments · Source: huggingface/transformers

564 🐛 Bug

I seem to be getting the following error each time I try to train with APEX/fp16 with BERT finetuning. It happened with my own scripts and I also see this with repository's standard finetune_on_pregenerated.py which was recently updated. The error diagnostics seem to indicate an issue with the FusedLayerNorm. To further confirm: doing a local mod where I replaced the definition of BertLayerNorm with

BertLayerNorm = torch.nn.LayerNorm

The change resolves this issue (while, in my case, not noticeably changing the performance).. Apex docs are a bit raw but the most recent set does not suggest to manually manipulate optimizers or layer definitions, perhaps we should just stick to the BertLayerNorm definition as described above?

Traceback (most recent call last):
  File "ash3/tune_bert.py", line 101, in <module>
    main(sys.argv[1:])
  File "ash3/tune_bert.py", line 47, in main
    pregenerate(init)
  File "ash3/tune_bert.py", line 85, in pregenerate
    finetune_on_pregenerated(tune_args)
  File "/home/madvillain/gitlab/ai/ash3/ash3/finetuning/finetune_on_pregenerated.py", line 292, in main
    outputs = model(input_ids, segment_ids, input_mask, lm_label_ids, is_next)
  File "/home/madvillain/miniconda3/envs/ash3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/madvillain/miniconda3/envs/ash3/lib/python3.6/site-packages/pytorch_transformers/modeling_bert.py", line 785, in forward
    prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output)
  File "/home/madvillain/miniconda3/envs/ash3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/madvillain/miniconda3/envs/ash3/lib/python3.6/site-packages/pytorch_transformers/modeling_bert.py", line 533, in forward
    prediction_scores = self.predictions(sequence_output)
  File "/home/madvillain/miniconda3/envs/ash3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/madvillain/miniconda3/envs/ash3/lib/python3.6/site-packages/pytorch_transformers/modeling_bert.py", line 501, in forward
    hidden_states = self.transform(hidden_states)
  File "/home/madvillain/miniconda3/envs/ash3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/madvillain/miniconda3/envs/ash3/lib/python3.6/site-packages/pytorch_transformers/modeling_bert.py", line 483, in forward
    hidden_states = self.LayerNorm(hidden_states)
  File "/home/madvillain/miniconda3/envs/ash3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/madvillain/miniconda3/envs/ash3/lib/python3.6/site-packages/apex/normalization/fused_layer_norm.py", line 159, in forward
    input, self.weight, self.bias, self.normalized_shape,self.eps)
  File "/home/madvillain/miniconda3/envs/ash3/lib/python3.6/site-packages/apex/normalization/fused_layer_norm.py", line 25, in forward
    input_, ctx.normalized_shape, weight_, bias_, ctx.eps)
RuntimeError: expected scalar type Half but found Float (data<c10::Half> at /home/madvillain/miniconda3/envs/ash3/lib/python3.6/site-packages/torch/include/ATen/core/TensorMethods.h:1386)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7f6af587edc5 in /home/madvillain/miniconda3/envs/ash3/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::Half* at::Tensor::data<c10::Half>() const + 0x2c6 (0x7f6abeb8aa36 in /home/madvillain/miniconda3/envs/ash3/lib/python3.6/site-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so)
frame #2: cuda_layer_norm(at::Tensor*, at::Tensor*, at::Tensor*, at::Tensor*, int, int, c10::ArrayRef<long>, at::Tensor*, at::Tensor*, double) + 0x3ed (0x7f6abeb87dcd in /home/madvillain/miniconda3/envs/ash3/lib/python3.6/site-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so)
frame #3: layer_norm_affine(at::Tensor, c10::ArrayRef<long>, at::Tensor, at::Tensor, double) + 0x27a (0x7f6abeb7985a in /home/madvillain/miniconda3/envs/ash3/lib/python3.6/site-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so)
frame #4: <unknown function> + 0x196c4 (0x7f6abeb866c4 in /home/madvillain/miniconda3/envs/ash3/lib/python3.6/site-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so)
frame #5: <unknown function> + 0x16e0a (0x7f6abeb83e0a in /home/madvillain/miniconda3/envs/ash3/lib/python3.6/site-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so)
<omitting python frames>
frame #12: THPFunction_apply(_object*, _object*) + 0x691 (0x7f6b24b0a081 in /home/madvillain/miniconda3/envs/ash3/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

Model I am using (Bert, XLNet....): BERT

Language I am using the model on (English, Chinese....): English

The problem arise when using:

[* ] the official example scripts: (give details)
[ ] my own modified scripts: (give details)

The tasks I am working on is:

[* ] an official GLUE/SQUaD task: (give the name) finetune_on_pregenerated.py
[ ] my own task or dataset: (give details)

Expected behavior

no failures

Environment

OS: Ubuntu 18.04
Python version: 3.6
PyTorch version: 1.1.0, 1.2.0
PyTorch Transformers version (or branch): 1.1.0
Using GPU ? yes
Distributed of parallel setup ? no
Any other relevant information: cudatoolkit 10.0, APEX git hash code: 53eae1986320d016ee7b347d78839dd5e96e7e93

Source

mksenzov

Most helpful comment

The problem is that this model in O1 enters to FusedLayerNorm.forward with the input in half-precision but its parameters are still in single-precision, and apparently the kernel doesn't support different types (neither does PyTorch's nn.LayerNorm). In O2, in contrast, the parameters are changed to half so the issue doesn't occur.

I believe there's no reason that FusedLayerNorm should be called if apex is available because the user may want to disable apex use O1, but it's incompatible with it. On the contrary, nn.LayerNorm is blacklisted in the amp initialization, so its input will always be float32 in O1, while FusedLayerNorm is not blacklisted.

Plus, nn.LayerNorm is probably fused and proved to be faster on a V100 to me with both float32 and float16.

bryant1410 on 23 Sep 2019

👍3

All 8 comments

Yes, that's what we do now on master since #1089 (switching back to torch.nn.LayerNorm).

Thanks for reporting

thomwolf on 2 Sep 2019

@thomwolf yes, thank you for your response! I wanted to clarify; if I do fp16 I still see that master is doing

try:
    from apex.normalization.fused_layer_norm import FusedLayerNorm as BertLayerNorm
except (ImportError, AttributeError) as e:
    logger.info("Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .")
    BertLayerNorm = torch.nn.LayerNorm

https://github.com/huggingface/pytorch-transformers/commit/bdb4409ed8de4d199907c75832398f2c49a564e1

and in my case FusedLayerNorm seem to cause the issue... so maybe we are talking about different things. Or did you mean that this is a work in progress and it was not merged to master yet?

mksenzov on 2 Sep 2019

Oh indeed, maybe it's a issue with finetune_on_pregenerated.py. The scripts in the lm_finetuning folder are in the process of being deprecated. You can try with the newly added run_lm_finetuning.py which is actively maintained.

thomwolf on 2 Sep 2019

setting --fp16_opt_level to O2 resolved that error for me.