The pytorch/fairseq team improved the memory effiency of their FP16 optimizer by converting the FP16 parameters to FP32 on the fly instead of keeping a static copy, see https://github.com/pytorch/fairseq/pull/404.
Are there any plans to implement this optimization here?
Thanks!
Based on reading their code I believe it's actually an unsafe application of the mixed precision recipe. They carry out updates in fp32, but then cast back to fp16, which loses all the additional mantissa information of the master params. This additional mantissa information is (at least conceptually) essential to capture small updates that accumulate over time (in other words, updates that are not big enough to be reflected in fp16 for a single step, but eventually get there as they accumulate over multiple steps). Basically, I think what they've done is equivalent to not using master params at all (https://github.com/pytorch/fairseq/pull/404#issuecomment-458640002).
I'm working on a new API that will give the options to flatten params. I suppose I can also give the option to not use fp32 master params, so people can test whether their network actually needs them.
Hello @mcarilli
I am just curious to see if you have comments about some benchmark.
fairseq claims a plain x2.9 gain when using FP16 vs FP32 (from the paper, on 8 V100)
At openNMT (both TF and pytorch) we see plain gain ranging from 30% to 50%.
the pytorch one is based on Apex.
We get extra gain when using gradient accumulation or additional small things but, it seems a very big gap compared to the x2.9 from the fairseq paper. It might be that the baseline of 8xV100 @FP32 is too weak by the way.
@vince62s, some ideas:
Not sure it's the right place but thanks for jumping in!
We see an increase of 2.8x on a single V100 GPU, going from 8k words-per-second (one V100 32-bit) -> 22k words-per-second (one V100 16-bit)
Is this for the "Transformer big" configuration?
embeddings=512 so yes.
Vocab, we just patched to pad at the next multiple of 8 so yes.
Batch= might be a point. We use token batch but we need to pad the sentence number to the next multiple of 8.
No, on single GPU we don't have a 2.8x factor. But our speed is much higher (not sure it's comparable because we work with the base model, 32k vocab, 100 sentence max len)
see here: https://github.com/OpenNMT/OpenNMT-py/pull/1208
however all these numbers are before the multiple 8 patch for vocab.
22k seems too low for the big transformer, mlperf submission is in ~35k vicinity. Granted, it has a few tricks that fairseq does not, but I'd eyeball them to be ~20% alltogether, so fairseq should be 25-28k (and that's approximately the performance we started with).
@vince62s @guillaumekln biggest and easiest thing if you are not doing it already is flattening parameters. Flattening parameters + using fused optimizer was the biggest part of fp16 speed-up for huggingface bert https://github.com/huggingface/pytorch-pretrained-BERT/pull/116
Following this conversation we implemented the fusedadam on opennmt-py.
We trained Base(512) and Medium(768) models very fine on 4 Turing.
I am training a big and experiencing some "nan" at times, like this:
[2019-02-18 15:10:00,245 INFO] Step 65200/140000; acc: 74.22; ppl: nan; xent: nan; lr: 0.00012; 29798/31767 tok/s; 4006 sec
[2019-02-18 15:11:14,561 INFO] Step 65300/140000; acc: 74.19; ppl: 3.04; xent: 1.11; lr: 0.00012; 30531/32481 tok/s; 4081 sec
[2019-02-18 15:12:29,152 INFO] Step 65400/140000; acc: 74.03; ppl: 3.07; xent: 1.12; lr: 0.00012; 30141/32358 tok/s; 4155 sec
even though I decreased the learning rate by 2.
Did you experience such a case ?
thanks.
Most helpful comment
@vince62s @guillaumekln biggest and easiest thing if you are not doing it already is flattening parameters. Flattening parameters + using fused optimizer was the biggest part of fp16 speed-up for huggingface bert https://github.com/huggingface/pytorch-pretrained-BERT/pull/116