For dot products, there is no speedup from using FP16 (MXNet is 4 times slower than PyTorch) on RTX 2080ti.
For ConvNets, there is similarly little or no gain when using FP16 in MXNet (Unlike with PyTorch)
MXNet:
import mxnet as mx
import numpy as np
import time
n = 2**14
ctx = mx.gpu(0)
dtype = np.float16
with ctx:
a = mx.nd.zeros((n, n), dtype=dtype)
b = mx.nd.zeros((n, n), dtype=dtype)
c = mx.nd.zeros((n, n), dtype=dtype)
tic = time.time()
for _ in range(100):
mx.nd.dot(a, b, out=c)
res = float(c[0, 0].asscalar()) # "use" the result
print(time.time() - tic)
(Outputs approximately 60)
PyTorch
import torch
import numpy as np
import time
n = 2**14
dtype = torch.float16
a = torch.zeros((n, n), dtype=dtype).cuda()
b = torch.zeros((n, n), dtype=dtype).cuda()
c = torch.zeros((n, n), dtype=dtype).cuda()
tic = time.time()
with torch.no_grad():
for _ in range(100):
torch.matmul(a, b, out=c)
res = float(c[0, 0]) # "use" the result
print(time.time() - tic)
(Outputs approximately 14)
I suspect that tensor cores are not enabled for this GPU in MXNet.
I tried to figure out if perhaps there is some flag or environment variable that I'm missing, but found nothing.
Nvidia RTX 2080ti
Ubuntu 18.04
CUDA 10.1
PyTorch 1.3.1
MXNet installed with ~/anaconda3/bin/pip install mxnet-cu101mkl
I can replicate the performance gap. Also, I added mx.nd.waitall() in the first script:
import mxnet as mx
import numpy as np
import time
n = 2**14
ctx = mx.gpu(0)
dtype = np.float16
with ctx:
a = mx.nd.zeros((n, n), dtype=dtype)
b = mx.nd.zeros((n, n), dtype=dtype)
c = mx.nd.zeros((n, n), dtype=dtype)
mx.nd.waitall()
tic = time.time()
for _ in range(100):
mx.nd.dot(a, b, out=c)
res = float(c[0, 0].asscalar()) # "use" the result
print(time.time() - tic)
In one GPU of P3.16x: Time: 57.40008759498596.
The time spent by pytorch is 8.085056066513062.
I tried with nvprof and find that MXNet and PyTorch uses different kernels:
For MXNet, it's volta_fp16_sgemm_fp16_64x64_nn.
ubuntu@ip-172-31-27-255:~$ sudo /usr/local/cuda/bin/nvprof python3 test_fp16.py
/usr/lib/python3/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
==117922== NVPROF is profiling process 117922, command: python3 test_fp16.py
57.4354133605957
==117922== Profiling application: python3 test_fp16.py
==117922== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 57.3739s 100 573.74ms 572.92ms 594.30ms volta_fp16_sgemm_fp16_64x64_nn
0.00% 1.7993ms 3 599.78us 599.42us 600.26us _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_10set_to_intILi0EEELi1EEEJPN7mshadow4half6half_tEEEEviDpT0_
0.00% 190.78us 100 1.9070us 1.7600us 6.8160us [CUDA memcpy DtoH]
0.00% 19.200us 12 1.6000us 1.5360us 1.9520us [CUDA memcpy HtoD]
0.00% 11.424us 8 1.4280us 1.4080us 1.4720us [CUDA memset]
API calls: 76.78% 57.3844s 203 282.68ms 6.2690us 594.29ms cudaStreamSynchronize
For PyTorch, it's volta_fp16_s884gemm_fp16_256x128_ldg8_f2f_nn.
ubuntu@ip-172-31-27-255:~$ vi test_fp16_pytorch.py
ubuntu@ip-172-31-27-255:~$ sudo /usr/local/cuda/bin/nvprof python3 test_fp16_pytorch.py
==118113== NVPROF is profiling process 118113, command: python3 test_fp16_pytorch.py
8.097127437591553
==118113== Profiling application: python3 test_fp16_pytorch.py
==118113== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 97.29% 8.08549s 100 80.855ms 80.561ms 93.579ms volta_fp16_s884gemm_fp16_256x128_ldg8_f2f_nn
2.71% 224.92ms 4 56.231ms 1.9200us 75.214ms [CUDA memcpy HtoD]
0.00% 186.40us 100 1.8640us 1.6640us 3.9680us [CUDA memcpy DtoH]
API calls: 50.26% 8.30841s 103 80.664ms 74.913ms 93.269ms cudaMemcpyAsync
49.40% 8.16635s 6 1.36106s 9.3230us 8.16199s cudaMalloc
0.11% 18.989ms 1528 12.427us 714ns 479.17us cuDeviceGetAttribute
0.11% 17.890ms 16 1.1181ms 1.0814ms 1.1642ms cudaGetDeviceProperties
@ptrendx Do you have any idea?
Hmmm, when I run this with NVIDIA container, I get the s884 kernel and time is 8.1. Looking at the dot implementation it seems that in our version it goes through linalg_gemm, whereas upstream MXNet is using some dot function (which I did not find yet, I assume it is in mshadow?) and I guess it does not set the proper math mode there.
Git blame shows that apparently @DickJC123 changed our version to use linalg_gemm 3 years ago and for some reason it never got upstreamed.
@DickJC123 Could you make a PR with that? On our side it is commit
commit 46d7fe1d3d482b2d43573ae483bd8403a843fedf
Author: Dick Carter <[email protected]>
Date: Fri Oct 6 13:57:38 2017 -0700
Switched mx.sym.{batched_dot,dot} to use {linalg_batched_gemm,linalg_gemm}.
I believe dot ultimately calls this: https://github.com/apache/incubator-mxnet/blob/master/3rdparty/mshadow/mshadow/dot_engine-inl.h#L516
@oleg-trott About this point "For ConvNets, there is similarly little or no gain when using FP16 in MXNet (Unlike with PyTorch)" - do you have an example? We tested pretty much all of the networks from GluonCV and we do not see that behavior.
@ptrendx
Without multi_precision, mxnet.optimizer.SGD says it would just use the same precision as the weights.
However, here, per iteration, I see
FP16 + multi_precision : 0.14
FP16 : 0.21
FP32 : 0.24-0.34
So, not using multi_precision is actually slower with FP16.
import os
os.environ['MXNET_SAFE_ACCUMULATION']='1'
import mxnet as mx
from mxnet import gluon, nd, autograd
from mxnet.gluon.model_zoo import vision
import numpy as np
from time import time
ctx = mx.gpu(0)
m = vision.resnet50_v2(pretrained=True, ctx=ctx)
bs = 32*2
n = 224
with ctx:
x = nd.random.randn(bs, 3, n, n)
target = nd.zeros(bs, dtype=np.int32)
if 1: # change this
x = x.astype('float16')
m.cast('float16')
loss = gluon.loss.SoftmaxCrossEntropyLoss()
if 1: # change this
args = {'learning_rate': 1e-9}
else:
args = {'learning_rate': 1e-9, 'multi_precision' : True}
opt = gluon.Trainer(m.collect_params(), 'sgd', args)
for i in range(100):
tic = time()
with autograd.record():
y = m(x)
out = loss(y, target)
out.backward()
opt.step(batch_size=bs)
nd.waitall()
print(time() - tic)
Hmm, I tried your code on both V100 and T4 and could not reproduce your problem:
on V100 I got:
BTW - please use m.hybridize(static_alloc=True, static_shape=True), that gives about 10% speed increase for me in this test (so e.g. V100 time is 0.74 after hybridization).
@ptrendx
Profiler output for the resnet50_v2 code (as posted above, FP16, but no multi-precision):
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 36.53% 8.03789s 1400 5.7413ms 262.53us 18.796ms _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS0_14MultiSGDKernelIN7mshadow4half6half_tELb0ELb0EEEJNS0_19MultiSGDKernelParamIS6_S6_EENS_9OpReqTypeEEEEviDpT0_
9.06% 1.99393s 20197 98.723us 2.1440us 850.04us void nchwToNhwcKernel<__half, __half, float, bool=1, bool=0>(int, int, int, int, __half const *, __half*, float, float)
7.33% 1.61244s 5022 321.08us 128.70us 1.1131ms turing_s1688cudnn_fp16_128x128_ldg8_wgrad_idx_exp_interior_nhwc_nt_v1
3.87% 850.75ms 5000 170.15us 17.888us 820.19us void op_generic_tensor_kernel<int=2, __half, float, __half, int=256, cudnnGenericOp_t=13, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=2>(cudnnTensorStruct, __half*, cudnnTensorStruct, __half const *, cudnnTensorStruct, __half const *, float, float, float, float, dimArray, reducedDivisorArray, bool)
3.25% 714.28ms 2400 297.62us 93.055us 845.85us void cudnn::detail::bn_bw_1C11_singleread_fp16<int=512, int=1, int=2, int=14>(float, float, float, float, cudnnTensorStruct, __half2 const *, cudnn::detail::bn_bw_1C11_singleread_fp16<int=512, int=1, int=2, int=14>, __half2 const , cudnn::detail::bn_bw_1C11_singleread_fp16<int=512, int=1, int=2, int=14>, cudnnTensorStruct*, float const *, float*, float const *, float const , float const , float, cudnn::reduced_divisor, int, float*, cudnn::detail::bnBwPersistentState*, int, float, float, float, int, float, cudnnStatus_t*, bool)
3.18% 700.83ms 2008 349.02us 253.15us 661.25us volta_fp16_s884cudnn_fp16_256x128_ldg8_dgrad_f2f_exp_interior_nhwc2nchw_tt_v1
3.03% 666.57ms 3200 208.30us 58.976us 700.03us _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op8identityELi1EEEJPN7mshadow4half6half_tEPKS9_EEEviDpT0_
2.86% 629.54ms 5000 125.91us 11.968us 642.81us void op_generic_tensor_kernel<int=2, __half, float, __half, int=256, cudnnGenericOp_t=8, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=1>(cudnnTensorStruct, __half*, cudnnTensorStruct, __half const *, cudnnTensorStruct, __half const *, float, float, float, float, dimArray, reducedDivisorArray, bool)
2.01% 443.40ms 2500 177.36us 57.855us 485.28us void cudnn::detail::bn_fw_tr_1C11_singleread_fp16<int=512, int=1, int=2, int=20>(cudnnTensorStruct, __half2 const *, cudnn::detail::bn_fw_tr_1C11_singleread_fp16<int=512, int=1, int=2, int=20>, cudnnTensorStruct*, float const *, float const , float, float, float*, float const *, float const *, float const *, float, float, cudnn::reduced_divisor, int, float*, cudnn::detail::bnFwPersistentState*, int, float, float, float, int, float, float, cudnnStatus_t*, bool)
2.01% 441.25ms 1507 292.80us 150.34us 406.05us turing_fp16_s1688cudnn_fp16_256x128_ldg8_relu_filter1x1_stg8_interior_nchw_nn_v1
1.86% 409.88ms 1600 256.18us 69.536us 761.85us _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS0_3SumEJPN7mshadow4half7half2_tENS_9OpReqTypeES7_S7_EEEviDpT0_
1.81% 397.44ms 1600 248.40us 72.864us 632.19us _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op4plusELi1EEEJPN7mshadow4half6half_tESA_SA_EEEviDpT0_
1.77% 388.79ms 1216 319.73us 274.40us 1.0458ms volta_sgemm_128x64_nt
1.65% 362.15ms 1105 327.74us 278.27us 390.75us turing_fp16_s1688cudnn_fp16_256x128_ldg8_relu_f2f_exp_small_nhwc_tn_v1
1.37% 300.68ms 4222 71.218us 10.208us 640.96us void nhwcToNchwKernel<__half, __half, float, bool=1, bool=0>(int, int, int, int, __half const *, __half*, float, float)
1.14% 250.31ms 1304 191.96us 163.36us 333.57us turing_fp16_s1688cudnn_fp16_256x128_ldg8_relu_f2f_exp_interior_nhwc_tn_v1
0.92% 202.14ms 100 2.0214ms 1.9949ms 2.0394ms void cudnn::detail::pooling_bw_kernel_max<__half, float, cudnn::detail::maxpooling_func<float, cudnnNanPropagation_t=0>, bool=0>(cudnnTensorStruct, __half const *, cudnn::detail::pooling_bw_kernel_max<__half, float, cudnn::detail::maxpooling_func<float, cudnnNanPropagation_t=0>, bool=0>, __half const , cudnn::detail::pooling_bw_kernel_max<__half, float, cudnn::detail::maxpooling_func<float, cudnnNanPropagation_t=0>, bool=0>, __half const , cudnn::detail::pooling_bw_kernel_max<__half, float, cudnn::detail::maxpooling_func<float, cudnnNanPropagation_t=0>, bool=0>, cudnnTensorStruct*, cudnnPoolingStruct, float, cudnnPoolingStruct, int, cudnn::reduced_divisor, float)
0.91% 200.70ms 5022 39.963us 3.6480us 626.27us void nhwcToNchwKernel<float, __half, float, bool=1, bool=0>(int, int, int, int, float const *, __half*, float, float)
0.90% 199.08ms 802 248.23us 238.33us 957.47us volta_fp16_s884cudnn_fp16_128x128_ldg8_dgrad_f2f_exp_interior_nhwc2nchw_tt_v1
0.86% 189.25ms 303 624.60us 558.97us 907.48us dgrad_1x1_stride_2x2
FP16 + multi-precision looks very different:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 11.37% 1.55330s 19997 77.676us 2.1120us 440.13us void nchwToNhwcKernel<__half, __half, float, bool=1, bool=0>(int, int, int, int, __half const *, __half*, float, float)
10.49% 1.43320s 4822 297.22us 128.13us 1.1295ms turing_s1688cudnn_fp16_128x128_ldg8_wgrad_idx_exp_interior_nhwc_nt_v1
6.14% 838.84ms 5000 167.77us 17.887us 817.88us void op_generic_tensor_kernel<int=2, __half, float, __half, int=256, cudnnGenericOp_t=13, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=2>(cudnnTensorStruct, __half*, cudnnTensorStruct, __half const *, cudnnTensorStruct, __half const *, float, float, float, float, dimArray, reducedDivisorArray, bool)
4.90% 670.07ms 2400 279.19us 92.895us 719.13us void cudnn::detail::bn_bw_1C11_singleread_fp16<int=512, int=1, int=2, int=14>(float, float, float, float, cudnnTensorStruct, __half2 const *, cudnn::detail::bn_bw_1C11_singleread_fp16<int=512, int=1, int=2, int=14>, __half2 const , cudnn::detail::bn_bw_1C11_singleread_fp16<int=512, int=1, int=2, int=14>, cudnnTensorStruct*, float const *, float*, float const *, float const , float const , float, cudnn::reduced_divisor, int, float*, cudnn::detail::bnBwPersistentState*, int, float, float, float, int, float, cudnnStatus_t*, bool)
4.73% 646.86ms 430 1.5043ms 734.72us 4.1373ms void cudnn::detail::wgrad_alg0_engine<__half, int=128, int=6, int=8, int=3, int=3, int=5, bool=1, int=512>(int, int, int, __half const *, int, cudnn::detail::wgrad_alg0_engine<__half, int=128, int=6, int=8, int=3, int=3, int=5, bool=1, int=512>*, __half const , kernel_grad_params, int, float, int, int, int, int)
4.61% 629.83ms 5000 125.97us 12.063us 626.20us void op_generic_tensor_kernel<int=2, __half, float, __half, int=256, cudnnGenericOp_t=8, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=1>(cudnnTensorStruct, __half*, cudnnTensorStruct, __half const *, cudnnTensorStruct, __half const *, float, float, float, float, dimArray, reducedDivisorArray, bool)
4.39% 600.18ms 1608 373.25us 261.12us 666.72us volta_fp16_s884cudnn_fp16_256x128_ldg8_dgrad_f2f_exp_interior_nhwc2nchw_tt_v1
4.38% 597.94ms 3200 186.86us 58.720us 412.00us _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op8identityELi1EEEJPN7mshadow4half6half_tEPKS9_EEEviDpT0_
3.25% 443.86ms 2500 177.55us 57.760us 495.61us void cudnn::detail::bn_fw_tr_1C11_singleread_fp16<int=512, int=1, int=2, int=20>(cudnnTensorStruct, __half2 const *, cudnn::detail::bn_fw_tr_1C11_singleread_fp16<int=512, int=1, int=2, int=20>, cudnnTensorStruct*, float const *, float const , float, float, float*, float const *, float const *, float const *, float, float, cudnn::reduced_divisor, int, float*, cudnn::detail::bnFwPersistentState*, int, float, float, float, int, float, float, cudnnStatus_t*, bool)
3.22% 439.58ms 1507 291.69us 149.92us 411.23us turing_fp16_s1688cudnn_fp16_256x128_ldg8_relu_filter1x1_stg8_interior_nchw_nn_v1
2.91% 397.03ms 1600 248.15us 72.736us 637.12us _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op4plusELi1EEEJPN7mshadow4half6half_tESA_SA_EEEviDpT0_
2.89% 394.91ms 1600 246.82us 69.727us 565.18us _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS0_3SumEJPN7mshadow4half7half2_tENS_9OpReqTypeES7_S7_EEEviDpT0_
2.63% 358.72ms 1105 324.64us 276.51us 388.96us turing_fp16_s1688cudnn_fp16_256x128_ldg8_relu_f2f_exp_small_nhwc_tn_v1
2.20% 300.11ms 4222 71.081us 10.048us 432.00us void nhwcToNchwKernel<__half, __half, float, bool=1, bool=0>(int, int, int, int, __half const *, __half*, float, float)
1.82% 248.61ms 1304 190.65us 163.55us 329.79us turing_fp16_s1688cudnn_fp16_256x128_ldg8_relu_f2f_exp_interior_nhwc_tn_v1
1.74% 238.11ms 716 332.55us 273.31us 1.0432ms volta_sgemm_128x64_nt
1.48% 202.58ms 100 2.0258ms 1.9909ms 2.0511ms void cudnn::detail::pooling_bw_kernel_max<__half, float, cudnn::detail::maxpooling_func<float, cudnnNanPropagation_t=0>, bool=0>(cudnnTensorStruct, __half const *, cudnn::detail::pooling_bw_kernel_max<__half, float, cudnn::detail::maxpooling_func<float, cudnnNanPropagation_t=0>, bool=0>, __half const , cudnn::detail::pooling_bw_kernel_max<__half, float, cudnn::detail::maxpooling_func<float, cudnnNanPropagation_t=0>, bool=0>, __half const , cudnn::detail::pooling_bw_kernel_max<__half, float, cudnn::detail::maxpooling_func<float, cudnnNanPropagation_t=0>, bool=0>, cudnnTensorStruct*, cudnnPoolingStruct, float, cudnnPoolingStruct, int, cudnn::reduced_divisor, float)
1.45% 198.56ms 802 247.58us 238.78us 256.09us volta_fp16_s884cudnn_fp16_128x128_ldg8_dgrad_f2f_exp_interior_nhwc2nchw_tt_v1
1.45% 198.10ms 412 480.82us 261.02us 1.2413ms volta_fp16_scudnn_fp16_128x128_stridedB_interior_nn_v1
1.39% 190.39ms 303 628.36us 556.54us 708.03us dgrad_1x1_stride_2x2
1.28% 175.54ms 301 583.18us 555.87us 597.08us volta_fp16_s884cudnn_fp16_256x64_ldg8_dgrad_f2f_exp_small_nhwc2nchw_tt_v1
1.16% 158.07ms 302 523.41us 497.89us 634.91us volta_fp16_s884cudnn_fp16_256x128_ldg8_dgrad_f2f_exp_small_nhwc2nchw_tt_v1
1.15% 157.52ms 4822 32.667us 3.6480us 241.02us void nhwcToNchwKernel<float, __half, float, bool=1, bool=0>(int, int, int, int, float const *, __half*, float, float)
1.10% 150.19ms 502 299.18us 141.79us 431.45us volta_fp16_s884cudnn_fp16_256x64_ldg8_dgrad_f2f_exp_interior_nhwc2nchw_tt_v1
1.09% 148.41ms 1800 82.448us 41.151us 160.13us void cudnn::detail::bn_bw_1C11_singleread_fp16<int=512, int=1, int=2, int=7>(float, float, float, float, cudnnTensorStruct, __half2 const *, cudnn::detail::bn_bw_1C11_singleread_fp16<int=512, int=1, int=2, int=7>, __half2 const , cudnn::detail::bn_bw_1C11_singleread_fp16<int=512, int=1, int=2, int=7>, cudnnTensorStruct*, float const *, float*, float const *, float const , float const , float, cudnn::reduced_divisor, int, float*, cudnn::detail::bnBwPersistentState*, int, float, float, float, int, float, cudnnStatus_t*, bool)
0.97% 132.01ms 332 397.62us 143.10us 7.1807ms volta_cgemm_32x32_tn
0.95% 129.19ms 704 183.51us 124.00us 390.24us turing_fp16_s1688cudnn_fp16_128x128_ldg8_relu_f2f_exp_interior_nhwc_tn_v1
0.83% 113.92ms 101 1.1279ms 1.0902ms 1.5125ms volta_fp16_scudnn_fp16_128x64_relu_medium_nn_v1
0.80% 109.48ms 301 363.71us 354.72us 369.98us turing_fp16_s1688cudnn_fp16_256x64_ldg8_relu_f2f_exp_small_nhwc_tn_v1
0.80% 109.22ms 216 505.65us 48.575us 9.9029ms volta_gcgemm_32x32_nt
@DickJC123 @ptrendx Are there any update for this issue? Would it also affect batched_dot which widely used in attention layers?
Will take a look at it today.
On Mar 2, 2020, at 10:31 PM, Xingjian Shi notifications@github.com wrote:
@DickJC123 @ptrendx Are there any update for this issue? Would it also affect batched_dot which widely used in attention layers?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
fp16 training too slow, any update for this issue?
Update the result in NGC 20.06 container on GV100 32GB card:
MXNet: 11.48s
Pytorch: 11.26s