Apex: Can not use tensor cores

Created on 26 Mar 2019 · 8Comments · Source: NVIDIA/apex

Hi ,
I am on an ubuntu machine with a 2080Ti using cuda 10.0,cuddn 7.4, python3.7 ,pytorch1.0.1 and ubuntu 16.04
I converted the model to use the tensorcore using amp module as specified by this example:

https://nvidia.github.io/apex/amp.html

but when i run my python program using the profiler nvprof as specified here
https://devtalk.nvidia.com/default/topic/1047165/how-to-confirm-whether-tensor-core-is-working-or-not-/

i get :

No events/metrics were profiled.

which as stated by modertator should not occur if my tensorcores were being used.
Can anyone help me why this is happening ?
any help is appreciated
Thanks

Source

vaibhav0195

Most helpful comment

Convolutions:
For cudnn versions 7.2 and ealier, @vaibhav0195 is correct: input channels, output channels, and batch size should be multiples of 8 to use tensor cores. However, this requirement is lifted for cudnn versions 7.3 and later. For cudnn 7.3 and later, you don't need to worry about making your channels/batch size multiples of 8 to enable Tensor Core use.

GEMMs (fully connected layers):
For matrix A x matrix B, where A has size [I, J] and B has size [J, K], I, J, and K must be multiples of 8 to use Tensor Cores. This requirement exists for all cublas and cudnn versions. This means that for bare fully connected layers, the batch size, input features, and output features must be multiples of 8, and for RNNs, you usually (but not always, it can be architecture-dependent depending on what you use for encoder/decoder) need to have batch size, hidden size, embedding size, and dictionary size as multiples of 8.

mcarilli on 29 Mar 2019

👍9 ❤4

All 8 comments

What was the command line you used to run your script under nvprof?

mcarilli on 26 Mar 2019

/usr/local/cuda/bin/nvprof --kernels compute_gemm --metrics tensor_precision_fu_utilization,tensor_int_fu_utilization python myscript.py

vaibhav0195 on 26 Mar 2019

Hi, @vaibhav0195, @mcarilli, must we change all the length (N, C, H, W) of a tensor so that they can be divided by 8 before we can make use of tensor cores?

hellojialee on 29 Mar 2019

@mcarilli i think just the input and output channels of the conv and the batch sizes should do the trick.

vaibhav0195 on 29 Mar 2019

mcarilli on 29 Mar 2019

👍9 ❤4

@mcarilli Thank you for your clear explanation.

hellojialee on 30 Mar 2019

It may also help to set
torch.backends.cudnn.benchmark=True
at the top of your script, which enables pytorch‘s autotuner. Each time pytorch encounters a new set of convolution parameters, it will test all available cudnn algorithms to find the fastest one, then cache that choice to reuse whenever it encounters the same set of convolution parameters again. The first iteration of your network will be slower as pytorch tests all the cudnn algorithms for each convolution, but the second iteration and later iterations will likely be faster.

mcarilli on 30 Mar 2019

ers, it will test all available cudnn algorithms to find the fastest one, then cache that choice to reuse whenever it encounters the same set of convolution parameters again. The first iteration of your network will be slower as pyt

Hi, thanks for your detailed explanation. Is the command to set autotoner
torch.backends.cudnn.benchmark=True
specific for Apex? Can we use it in more general cases?
Thanks.