A full amp example would be useful. It would help answer questions like:
Do we need to call ".half()" on the model?
Do we need to call init() before the model is built?
Enabling amp seems to slow training down, why might this be?
Agreed that an amp example would be useful. I'll post one tomorrow.
When using amp, you do not need to call .half() on the model, but it's recommended to call init() before the model is built.
Enabling amp should be modestly slower than "pure fp16" training, because of the occasional casts to and from fp32 that it performs. Or do you mean that amp is slower than "pure fp32" training? If so, we might have to take a closer look at the model.
Was the example ever posted here?
I'm trying to use AMP with a GAN on a Titan V, and getting only really trivial improvements: about 2-3% in terms of both time and memory usage. I have multiple optimizers and multiple backward passes, but have (I think) followed the instructions for implementing those correctly.
My models are all convnets, and nearly all the levels have input, output, and batch sizes in multiples of 8.
This is on Pytorch 1 stable, CUDA 10, and Ubuntu 18.04..
@mcarilli waiting for amp example
Sorry I forgot about this in the heat of mlperf. I've posted an example using Amp with Imagenet (examples/main_amp.py). The differences between main.py and main_amp.py should make sense if you vimdiff them, or otherwise compare them side by side. Note that in main_amp.py you do not need to call .half(), or use network_to_half(), on either the model or the input data. Conversions will happen on the fly within patched torch functions.
@jshanna100 In cases like this, the first thing I always suggest is to try "pure half" training without Amp: Just call .half() on your model and data, and let it run. It might not be numerically safe, but it will serve as a performance reference point: If the "pure half" training achieves minimal speedup over pure FP32 training, then we know the poor speedup isn't Amp's fault. It could be your model is latency bound, CPU bound, or you could be hitting some calls in cudnn or cublas that are unfriendly to FP16.
On the other hand, if "pure half" performance is good relative to fp16, but turning on Amp kills the improvement, then we need to examine the operation of Amp itself.
I don't have an Amp GAN example lying around, unfortunately. @carlc-nv might though...
We've added a DCGAN example here: https://github.com/NVIDIA/apex/tree/master/examples/dcgan
Most helpful comment
Agreed that an amp example would be useful. I'll post one tomorrow.
When using amp, you do not need to call .half() on the model, but it's recommended to call init() before the model is built.
Enabling amp should be modestly slower than "pure fp16" training, because of the occasional casts to and from fp32 that it performs. Or do you mean that amp is slower than "pure fp32" training? If so, we might have to take a closer look at the model.