What's the advantage of using the FusedLayerNorm over torch.nn.LayerNorm? I'm running into an issue with using TorchScript and I'm wondering if I can replace the former with the latter.
The deeper question is: Is the apex version of layer norm significantly optimized over the standard pytorch version or is it simply a legacy of when pytorch did not have a built in layer norm function?
FusedLayerNorm should give a speedup compared to torch.nn.LayerNorm.
Gist for profiling: https://gist.github.com/ptrblck/8b1c6a7efd97604a7dedbf2c3edd1019
I'm trying that gist with 10,000 iterations on a V100 and torch.nn.LayerNorm is faster:
upstream layernorm 32.502
apex layernorm 33.823
And it's even more if I convert the input to half and do norm.half() and fused_norm.half:
upstream layernorm 23.555
apex layernorm 31.152
I put it in a gist: https://gist.github.com/bryant1410/d88a42a4b1a3c2989a1db6c79f07e045
@bryant1410 Thanks for reporting!
I'm not sure, where the perf regression comes from.
However, @zasdfgbnm is porting out FusedLayerNorm approach to PyTorch in this PR, which should land hopefully soon.
This is what I get on the DGX station (Tesla V100 GPU) with the master-py3-devel docker:
upstream layernorm 31.495
apex layernorm 32.434
upstream half layernorm 22.503
apex half layernorm 29.342
With pytorch and apex master compiled from source:
upstream layernorm 26.903
apex layernorm 32.299
upstream half layernorm 20.610
apex half layernorm 29.691
Pull request https://github.com/pytorch/pytorch/pull/26201 ("upstream" means layernorm implementation ported from APEX):
upstream layernorm 39.049
apex layernorm 32.504
upstream half layernorm 35.924
apex half layernorm 29.523
Btw, I ran mine with the commit 880ab92, not with master.
Is The Fused LN heavily optimized for transformer like application cause I do get big speed up for standard NLP representation (T, B, C) and taking norm across C.
upstream layernorm 2.036
apex layernorm 0.620
upstream half layernorm 1.470
apex half layernorm 0.473
@ngoyal2707 Which upstream version are you using? The layernorm in upstream has been improved a lot recently.
I observe the same issue as @ngoyal2707 on PyTorch 1.5 -- torch.nn.LayerNorm is slower than apex.FusedLayerNorm for shapes typical in NLP models. For example: (512, 16, 1024) with normalization over the last dimension is slower using torch.nn.LayerNorm.
I also see a performance boost using the FusedLayerNorm for our NLP-based transformer.
I just replaced all LayerNorm by the apex version in a model from Transformers library (Roberta based), and on a real dataset with sequence length on average of 200 tokens. So basically real life setup, I can't measure any difference. I have also run the benchmark and I get on the same machine :
upstream layernorm 2.132
apex layernorm 2.745
@vgoklani is it a custom transformer or from an OSS library?
I ran the gist with shape (32, 128, 768) which is common in Transformers on V100/CUDA10. What I got:
upstream layernorm 0.136
apex layernorm 0.040
upstream layernorm(half) 0.106
apex layernorm(half) 0.047
After changing the sequence length to 256:
upstream layernorm 0.258
apex layernorm 0.070
upstream layernorm(half) 0.203
apex layernorm(half) 0.045
@pommedeterresautee I suggest you provide your device and CUDA version and that'll be more helpful.
@hitvoice 2080 TI and apex from master branch at the time of my precedent message, so June 16th
Most helpful comment
I observe the same issue as @ngoyal2707 on PyTorch 1.5 -- torch.nn.LayerNorm is slower than apex.FusedLayerNorm for shapes typical in NLP models. For example: (512, 16, 1024) with normalization over the last dimension is slower using torch.nn.LayerNorm.