I run following scripts and compare the logs of them,
fp32 training:
python main_fp16_optimizer.py /workspace/data/imagenet
and fp16 mixed precision training:
python main_fp16_optimizer.py /workspace/data/imagenet --fp16
Here are theirs logs,
fp32 training logs:
Epoch: [0][10/1563] Time 0.211 (0.507) Speed 151.834 (63.162) Data 0.001 (0.075) Loss 7.0819 (7.0585) Prec@1 0.000 (0.000) Prec@5 0.000 (0.000)
and fp16 mixed precision training logs:
Epoch: [0][10/1563] Time 0.220 (0.530) Speed 145.334 (60.358) Data 0.001 (0.068) Loss 7.1602 (7.0614) Prec@1 0.000 (0.852) Prec@5 0.000 (1.136)
It's easy to find that the mixed precision training version didn't faster much, so is there anything wrong?
btw, I used a single gpu.
Thanks
What gpu are you using? For those particular examples, I would only expect to see significant speedups on a device with Tensor Cores (Volta or Turing). Other architectures would benefit from the reduced bandwidth requirements of FP16, but the compute won't be faster than FP32 (and for some Pascal cards like the 1080Ti, the compute throughput is actually much slower in FP16).
What gpu are you using? For those particular examples, I would only expect to see significant speedups on a device with Tensor Cores (Volta or Turing). Other architectures would benefit from the reduced bandwidth requirements of FP16, but the compute won't be faster than FP32 (and for some Pascal cards like the 1080Ti, the compute throughput is actually much slower in FP16).
Thank you, my gpu was k40(Tesla), it seems which not support Tensor Cores.
Just leaving this question here as it is kind of related:
I am running training on V100 and Titan V GPUs. Using either amp or FP16_optimizer gives a nice speed boost for 2D convolutions, but does nothing at all (or evden slows down) 3D convolutions. Are 3D convolutions not supported by tensor cores? I could not find any information about this anywhere.
What are the ideal settings for 2D convolutions with tensor cores? From the documentation I read that input feature map size, input num channels and output num_channels all must be multiples of 8?
Both input and output channel dimensions must be a multiple of eight. Again as in cuBLAS, the Tensor Core math routines stride through input data in steps of eight values, so the dimensions of the input data must be multiples of eight.
(https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/)
However, here (https://discuss.pytorch.org/t/volta-tensor-core-pytorch/18320), ngimel stated that this must be true for batch size as well?
I think repeating the relevant constraints for using tensor cores in the documentation would be very beneficial for new users.
Best,
Fabian
Unfortunately, your suspicion about 3D convolutions being unsupported by Tensor Cores is correct: https://github.com/NVIDIA/apex/issues/140#issuecomment-460409353. Cudnn team is aware. I think 3D convolutions were viewed as an experimental/niche thing, but they're becoming important (e.g. for medical imaging), so hopefully they will be Tensor Core-enabled soon. In my brief experience with 3D convolutional networks, the memory consumption tends to be substantial. Mixed precision will certainly help with that aspect even if it (currently) doesn't provide much speedup
Also, I believe @ngimel's version is correct. For cudnn 7.2 and earlier, you should set the batch size N, number of input channels C, and number of output channels K to multiples of 8 to allow Tensor Core use. This ensures that the dimensions of the resulting lowered GEMM are multiples of 8 (under the hood, the N,C,K%8 == 0 requirement for convolutions actually results from the same hardware constraint as the N,M,O%8 == 0 requirement for an [N, M] x [M, O] explicit GEMM).
For cudnn 7.3 and later, for NCHW convolutions (the default in pytorch), cudnn will pad the internal buffers it creates on the fly, so you no longer have to worry about the N,C,K%8 == 0 requirement. You can check the version Pytorch is using with torch.backends.cudnn.version().
To allow Tensor Cores for explicit [N, M] x [M, O] GEMMs, the requirement that N,M,O%8 == 0 is still present.
Medical imaging is exactly what I am doing, so getting the reduced VRAM consumption is great by itself already! Of course a little speed bump would be appreciated, too. I am glad to hear Nvidia is working on it!
For cudnn 7.3 and later, for NCHW convolutions (the default in pytorch), cudnn will pad the internal buffers it creates on the fly, so you no longer have to worry about the N,C,K%8 == 0 requirement. You can check the version Pytorch is using with torch.backends.cudnn.version().
I suspected something like this might be going on because I saw quite a substantial speed improvement for 2D convolutions even if this requirement was not met. I was using a U-Net like model that has 30 (30 % 8 != 0) feature maps at the very top and doubles them with every pooling operation. Changing that to 32 improved the speed even further. My guess is that this was faster because it took away the padding requirement?
Anyways, once again thanks!
(You could consider copy pasting all this useful information into the documentation for beginners :-) )
In general, GPUs like contiguous tensors in which the beginning each fastest-dim row is aligned to at least 32 bytes. The change you made may have helped with that requirement for some ops in the network, so the speedup you observed may have had nothing to do with cuDNN. Then again, it might also have made cuDNN's padding job easier (cuDNN needs to transpose the data at certain points, and inserts padding while it transposes).
Alright, I'm going to be updating the documentation substantially anyway for the merge of my "Amp 1.0" release by the end of the month. I'm giving a webinar about that today if you're interested. https://info.nvidia.com/webinar-mixed-precision-with-pytorch-reg-page.html Sorry, I should have remembered to say that earlier. I will post the presentation afterwards.
Most helpful comment
What gpu are you using? For those particular examples, I would only expect to see significant speedups on a device with Tensor Cores (Volta or Turing). Other architectures would benefit from the reduced bandwidth requirements of FP16, but the compute won't be faster than FP32 (and for some Pascal cards like the 1080Ti, the compute throughput is actually much slower in FP16).