Apex: FusedLayerNorm vs torch.nn.LayerNorm

Created on 23 Aug 2019  路  13Comments  路  Source: NVIDIA/apex

What's the advantage of using the FusedLayerNorm over torch.nn.LayerNorm? I'm running into an issue with using TorchScript and I'm wondering if I can replace the former with the latter.

The deeper question is: Is the apex version of layer norm significantly optimized over the standard pytorch version or is it simply a legacy of when pytorch did not have a built in layer norm function?

Most helpful comment

I observe the same issue as @ngoyal2707 on PyTorch 1.5 -- torch.nn.LayerNorm is slower than apex.FusedLayerNorm for shapes typical in NLP models. For example: (512, 16, 1024) with normalization over the last dimension is slower using torch.nn.LayerNorm.

All 13 comments

FusedLayerNorm should give a speedup compared to torch.nn.LayerNorm.
Gist for profiling: https://gist.github.com/ptrblck/8b1c6a7efd97604a7dedbf2c3edd1019

I'm trying that gist with 10,000 iterations on a V100 and torch.nn.LayerNorm is faster:

upstream layernorm 32.502
apex layernorm 33.823

And it's even more if I convert the input to half and do norm.half() and fused_norm.half:

upstream layernorm 23.555
apex layernorm 31.152

@bryant1410 Thanks for reporting!
I'm not sure, where the perf regression comes from.
However, @zasdfgbnm is porting out FusedLayerNorm approach to PyTorch in this PR, which should land hopefully soon.

This is what I get on the DGX station (Tesla V100 GPU) with the master-py3-devel docker:

upstream layernorm 31.495
apex layernorm 32.434
upstream half layernorm 22.503
apex half layernorm 29.342

With pytorch and apex master compiled from source:

upstream layernorm 26.903
apex layernorm 32.299
upstream half layernorm 20.610
apex half layernorm 29.691

Pull request https://github.com/pytorch/pytorch/pull/26201 ("upstream" means layernorm implementation ported from APEX):

upstream layernorm 39.049
apex layernorm 32.504
upstream half layernorm 35.924
apex half layernorm 29.523

Btw, I ran mine with the commit 880ab92, not with master.

Is The Fused LN heavily optimized for transformer like application cause I do get big speed up for standard NLP representation (T, B, C) and taking norm across C.

upstream layernorm 2.036
apex layernorm 0.620
upstream half layernorm 1.470
apex half layernorm 0.473

@ngoyal2707 Which upstream version are you using? The layernorm in upstream has been improved a lot recently.

I observe the same issue as @ngoyal2707 on PyTorch 1.5 -- torch.nn.LayerNorm is slower than apex.FusedLayerNorm for shapes typical in NLP models. For example: (512, 16, 1024) with normalization over the last dimension is slower using torch.nn.LayerNorm.

I also see a performance boost using the FusedLayerNorm for our NLP-based transformer.

I just replaced all LayerNorm by the apex version in a model from Transformers library (Roberta based), and on a real dataset with sequence length on average of 200 tokens. So basically real life setup, I can't measure any difference. I have also run the benchmark and I get on the same machine :

upstream layernorm 2.132
apex layernorm 2.745

@vgoklani is it a custom transformer or from an OSS library?

I ran the gist with shape (32, 128, 768) which is common in Transformers on V100/CUDA10. What I got:

upstream layernorm 0.136
apex layernorm 0.040
upstream layernorm(half) 0.106
apex layernorm(half) 0.047

After changing the sequence length to 256:

upstream layernorm 0.258
apex layernorm 0.070
upstream layernorm(half) 0.203
apex layernorm(half) 0.045

@pommedeterresautee I suggest you provide your device and CUDA version and that'll be more helpful.

@hitvoice 2080 TI and apex from master branch at the time of my precedent message, so June 16th

Was this page helpful?
0 / 5 - 0 ratings