Tensor2Tensor has AdaFactor to drastically reduce the GPU memory usage. I believe it would be helpful for FairSeq to have this by default.
Working on this
Good idea!
See Adafactor here : https://github.com/pytorch/fairseq/blob/master/fairseq/optim/adafactor.py
Most helpful comment
Working on this