Apex: FusedLayerNorm vs torch.nn.LayerNorm

Created on 23 Aug 2019 · 13Comments · Source: NVIDIA/apex

What's the advantage of using the FusedLayerNorm over torch.nn.LayerNorm? I'm running into an issue with using TorchScript and I'm wondering if I can replace the former with the latter.

The deeper question is: Is the apex version of layer norm significantly optimized over the standard pytorch version or is it simply a legacy of when pytorch did not have a built in layer norm function?

Source

dhpollack

Most helpful comment

I observe the same issue as @ngoyal2707 on PyTorch 1.5 -- torch.nn.LayerNorm is slower than apex.FusedLayerNorm for shapes typical in NLP models. For example: (512, 16, 1024) with normalization over the last dimension is slower using torch.nn.LayerNorm.

myleott on 2 May 2020

👍4

All 13 comments

FusedLayerNorm should give a speedup compared to torch.nn.LayerNorm.
Gist for profiling: https://gist.github.com/ptrblck/8b1c6a7efd97604a7dedbf2c3edd1019

ptrblck on 2 Sep 2019

I'm trying that gist with 10,000 iterations on a V100 and torch.nn.LayerNorm is faster:

upstream layernorm 32.502
apex layernorm 33.823

And it's even more if I convert the input to half and do norm.half() and fused_norm.half:

upstream layernorm 23.555
apex layernorm 31.152

bryant1410 on 23 Sep 2019

I put it in a gist: https://gist.github.com/bryant1410/d88a42a4b1a3c2989a1db6c79f07e045

bryant1410 on 23 Sep 2019

@bryant1410 Thanks for reporting!
I'm not sure, where the perf regression comes from.
However, @zasdfgbnm is porting out FusedLayerNorm approach to PyTorch in this PR, which should land hopefully soon.

ptrblck on 23 Sep 2019

This is what I get on the DGX station (Tesla V100 GPU) with the master-py3-devel docker:

upstream layernorm 31.495
apex layernorm 32.434
upstream half layernorm 22.503
apex half layernorm 29.342

With pytorch and apex master compiled from source:

upstream layernorm 26.903
apex layernorm 32.299
upstream half layernorm 20.610
apex half layernorm 29.691

Pull request https://github.com/pytorch/pytorch/pull/26201 ("upstream" means layernorm implementation ported from APEX):

upstream layernorm 39.049
apex layernorm 32.504
upstream half layernorm 35.924
apex half layernorm 29.523

zasdfgbnm on 23 Sep 2019

Btw, I ran mine with the commit 880ab92, not with master.

bryant1410 on 24 Sep 2019

Is The Fused LN heavily optimized for transformer like application cause I do get big speed up for standard NLP representation (T, B, C) and taking norm across C.

upstream layernorm 2.036
apex layernorm 0.620
upstream half layernorm 1.470
apex half layernorm 0.473

ngoyal2707 on 19 Dec 2019

@ngoyal2707 Which upstream version are you using? The layernorm in upstream has been improved a lot recently.

zasdfgbnm on 19 Dec 2019

myleott on 2 May 2020

👍4

I also see a performance boost using the FusedLayerNorm for our NLP-based transformer.

vgoklani on 30 May 2020

I just replaced all LayerNorm by the apex version in a model from Transformers library (Roberta based), and on a real dataset with sequence length on average of 200 tokens. So basically real life setup, I can't measure any difference. I have also run the benchmark and I get on the same machine :

upstream layernorm 2.132
apex layernorm 2.745

@vgoklani is it a custom transformer or from an OSS library?

pommedeterresautee on 16 Jun 2020

I ran the gist with shape (32, 128, 768) which is common in Transformers on V100/CUDA10. What I got:

upstream layernorm 0.136
apex layernorm 0.040
upstream layernorm(half) 0.106
apex layernorm(half) 0.047

After changing the sequence length to 256:

upstream layernorm 0.258
apex layernorm 0.070
upstream layernorm(half) 0.203
apex layernorm(half) 0.045

@pommedeterresautee I suggest you provide your device and CUDA version and that'll be more helpful.

hitvoice on 16 Jul 2020

@hitvoice 2080 TI and apex from master branch at the time of my precedent message, so June 16th

pommedeterresautee on 16 Jul 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

FusedLayerNorm leads to RuntimeError: CUDA error: no kernel image is available for execution on the device

yangkky · 4Comments

installation failed: Given no hashes to check 123 links for project 'pip': discarding no candidates

DeeDive · 4Comments

Segmentation fault

lemonhu · 3Comments

Meet "fatal error: torch/extension.h: No such file or directory compilation terminated." when install with cuda_ext

dxxz · 3Comments

Learning Scheduler

TheRevanchist · 3Comments