Incubator-mxnet: No speedup from using FP16 (4 times slower than PyTorch)

Created on 24 Feb 2020 · 12Comments · Source: apache/incubator-mxnet

Description

For dot products, there is no speedup from using FP16 (MXNet is 4 times slower than PyTorch) on RTX 2080ti.

For ConvNets, there is similarly little or no gain when using FP16 in MXNet (Unlike with PyTorch)

To Reproduce

MXNet:

import mxnet as mx
import numpy as np
import time

n = 2**14

ctx = mx.gpu(0)
dtype = np.float16

with ctx:
    a = mx.nd.zeros((n, n), dtype=dtype)
    b = mx.nd.zeros((n, n), dtype=dtype)
    c = mx.nd.zeros((n, n), dtype=dtype)


tic = time.time()
for _ in range(100):
    mx.nd.dot(a, b, out=c)
    res = float(c[0, 0].asscalar()) # "use" the result
print(time.time() - tic)

(Outputs approximately 60)

PyTorch

import torch
import numpy as np
import time

n = 2**14

dtype = torch.float16

a = torch.zeros((n, n), dtype=dtype).cuda()
b = torch.zeros((n, n), dtype=dtype).cuda()
c = torch.zeros((n, n), dtype=dtype).cuda()

tic = time.time()
with torch.no_grad():
    for _ in range(100):
        torch.matmul(a, b, out=c)
        res = float(c[0, 0]) # "use" the result
print(time.time() - tic)

(Outputs approximately 14)

What have you tried to solve it?

I suspect that tensor cores are not enabled for this GPU in MXNet.

I tried to figure out if perhaps there is some flag or environment variable that I'm missing, but found nothing.

Environment

Nvidia RTX 2080ti
Ubuntu 18.04
CUDA 10.1
PyTorch 1.3.1
MXNet installed with ~/anaconda3/bin/pip install mxnet-cu101mkl

Bug Performance

Source

oleg-trott

All 12 comments

I can replicate the performance gap. Also, I added mx.nd.waitall() in the first script:

import mxnet as mx
import numpy as np
import time

n = 2**14

ctx = mx.gpu(0)
dtype = np.float16

with ctx:
    a = mx.nd.zeros((n, n), dtype=dtype)
    b = mx.nd.zeros((n, n), dtype=dtype)
    c = mx.nd.zeros((n, n), dtype=dtype)
mx.nd.waitall()

tic = time.time()
for _ in range(100):
    mx.nd.dot(a, b, out=c)
    res = float(c[0, 0].asscalar()) # "use" the result
print(time.time() - tic)

In one GPU of P3.16x: Time: 57.40008759498596.

The time spent by pytorch is 8.085056066513062.

sxjscience on 27 Feb 2020

I tried with nvprof and find that MXNet and PyTorch uses different kernels:

For MXNet, it's volta_fp16_sgemm_fp16_64x64_nn.

ubuntu@ip-172-31-27-255:~$ sudo /usr/local/cuda/bin/nvprof python3 test_fp16.py 
/usr/lib/python3/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
==117922== NVPROF is profiling process 117922, command: python3 test_fp16.py
57.4354133605957
==117922== Profiling application: python3 test_fp16.py
==117922== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  57.3739s       100  573.74ms  572.92ms  594.30ms  volta_fp16_sgemm_fp16_64x64_nn
                    0.00%  1.7993ms         3  599.78us  599.42us  600.26us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_10set_to_intILi0EEELi1EEEJPN7mshadow4half6half_tEEEEviDpT0_
                    0.00%  190.78us       100  1.9070us  1.7600us  6.8160us  [CUDA memcpy DtoH]
                    0.00%  19.200us        12  1.6000us  1.5360us  1.9520us  [CUDA memcpy HtoD]
                    0.00%  11.424us         8  1.4280us  1.4080us  1.4720us  [CUDA memset]
      API calls:   76.78%  57.3844s       203  282.68ms  6.2690us  594.29ms  cudaStreamSynchronize

For PyTorch, it's volta_fp16_s884gemm_fp16_256x128_ldg8_f2f_nn.

ubuntu@ip-172-31-27-255:~$ vi test_fp16_pytorch.py
ubuntu@ip-172-31-27-255:~$ sudo /usr/local/cuda/bin/nvprof python3 test_fp16_pytorch.py 
==118113== NVPROF is profiling process 118113, command: python3 test_fp16_pytorch.py
8.097127437591553
==118113== Profiling application: python3 test_fp16_pytorch.py
==118113== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   97.29%  8.08549s       100  80.855ms  80.561ms  93.579ms  volta_fp16_s884gemm_fp16_256x128_ldg8_f2f_nn
                    2.71%  224.92ms         4  56.231ms  1.9200us  75.214ms  [CUDA memcpy HtoD]
                    0.00%  186.40us       100  1.8640us  1.6640us  3.9680us  [CUDA memcpy DtoH]
      API calls:   50.26%  8.30841s       103  80.664ms  74.913ms  93.269ms  cudaMemcpyAsync
                   49.40%  8.16635s         6  1.36106s  9.3230us  8.16199s  cudaMalloc
                    0.11%  18.989ms      1528  12.427us     714ns  479.17us  cuDeviceGetAttribute
                    0.11%  17.890ms        16  1.1181ms  1.0814ms  1.1642ms  cudaGetDeviceProperties

sxjscience on 27 Feb 2020

👍1

@ptrendx Do you have any idea?

sxjscience on 28 Feb 2020

Hmmm, when I run this with NVIDIA container, I get the s884 kernel and time is 8.1. Looking at the dot implementation it seems that in our version it goes through linalg_gemm, whereas upstream MXNet is using some dot function (which I did not find yet, I assume it is in mshadow?) and I guess it does not set the proper math mode there.

Git blame shows that apparently @DickJC123 changed our version to use linalg_gemm 3 years ago and for some reason it never got upstreamed.

@DickJC123 Could you make a PR with that? On our side it is commit

commit 46d7fe1d3d482b2d43573ae483bd8403a843fedf                                                                                                                                                                    
Author: Dick Carter <[email protected]>                                                                                                                                                                           
Date:   Fri Oct 6 13:57:38 2017 -0700

    Switched mx.sym.{batched_dot,dot} to use {linalg_batched_gemm,linalg_gemm}.

ptrendx on 28 Feb 2020

👍1

I believe dot ultimately calls this: https://github.com/apache/incubator-mxnet/blob/master/3rdparty/mshadow/mshadow/dot_engine-inl.h#L516

@oleg-trott About this point "For ConvNets, there is similarly little or no gain when using FP16 in MXNet (Unlike with PyTorch)" - do you have an example? We tested pretty much all of the networks from GluonCV and we do not see that behavior.

ptrendx on 28 Feb 2020

@ptrendx

Without multi_precision, mxnet.optimizer.SGD says it would just use the same precision as the weights.

However, here, per iteration, I see

FP16 + multi_precision : 0.14
FP16 : 0.21
FP32 : 0.24-0.34

So, not using multi_precision is actually slower with FP16.

import os
os.environ['MXNET_SAFE_ACCUMULATION']='1'

import mxnet as mx
from mxnet import gluon, nd, autograd
from mxnet.gluon.model_zoo import vision
import numpy as np
from time import time

ctx = mx.gpu(0)

m = vision.resnet50_v2(pretrained=True, ctx=ctx)

bs = 32*2
n = 224

with ctx:
    x = nd.random.randn(bs, 3, n, n)
    target = nd.zeros(bs, dtype=np.int32)

if 1: # change this
    x = x.astype('float16')
    m.cast('float16')

loss = gluon.loss.SoftmaxCrossEntropyLoss()

if 1: # change this
    args = {'learning_rate': 1e-9}
else:
    args = {'learning_rate': 1e-9, 'multi_precision' : True}

opt = gluon.Trainer(m.collect_params(), 'sgd', args)

for i in range(100):
    tic = time()
    with autograd.record():
        y = m(x)
        out = loss(y, target)
    out.backward()
    opt.step(batch_size=bs)
    nd.waitall()
    print(time() - tic)

oleg-trott on 28 Feb 2020

Hmm, I tried your code on both V100 and T4 and could not reproduce your problem:
on V100 I got:

0.084 for fp16
0.085 for fp16 with multi_precision=True
0.182 for fp32
on T4 I got:
0.27 for fp16
0.265 for fp16 with multi_precision-True
0.55 for fp32

BTW - please use m.hybridize(static_alloc=True, static_shape=True), that gives about 10% speed increase for me in this test (so e.g. V100 time is 0.74 after hybridization).

ptrendx on 28 Feb 2020

@ptrendx

Profiler output for the resnet50_v2 code (as posted above, FP16, but no multi-precision):


            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   36.53%  8.03789s      1400  5.7413ms  262.53us  18.796ms  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS0_14MultiSGDKernelIN7mshadow4half6half_tELb0ELb0EEEJNS0_19MultiSGDKernelParamIS6_S6_EENS_9OpReqTypeEEEEviDpT0_
                    9.06%  1.99393s     20197  98.723us  2.1440us  850.04us  void nchwToNhwcKernel<__half, __half, float, bool=1, bool=0>(int, int, int, int, __half const *, __half*, float, float)
                    7.33%  1.61244s      5022  321.08us  128.70us  1.1131ms  turing_s1688cudnn_fp16_128x128_ldg8_wgrad_idx_exp_interior_nhwc_nt_v1
                    3.87%  850.75ms      5000  170.15us  17.888us  820.19us  void op_generic_tensor_kernel<int=2, __half, float, __half, int=256, cudnnGenericOp_t=13, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=2>(cudnnTensorStruct, __half*, cudnnTensorStruct, __half const *, cudnnTensorStruct, __half const *, float, float, float, float, dimArray, reducedDivisorArray, bool)
                    3.25%  714.28ms      2400  297.62us  93.055us  845.85us  void cudnn::detail::bn_bw_1C11_singleread_fp16<int=512, int=1, int=2, int=14>(float, float, float, float, cudnnTensorStruct, __half2 const *, cudnn::detail::bn_bw_1C11_singleread_fp16<int=512, int=1, int=2, int=14>, __half2 const , cudnn::detail::bn_bw_1C11_singleread_fp16<int=512, int=1, int=2, int=14>, cudnnTensorStruct*, float const *, float*, float const *, float const , float const , float, cudnn::reduced_divisor, int, float*, cudnn::detail::bnBwPersistentState*, int, float, float, float, int, float, cudnnStatus_t*, bool)
                    3.18%  700.83ms      2008  349.02us  253.15us  661.25us  volta_fp16_s884cudnn_fp16_256x128_ldg8_dgrad_f2f_exp_interior_nhwc2nchw_tt_v1
                    3.03%  666.57ms      3200  208.30us  58.976us  700.03us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op8identityELi1EEEJPN7mshadow4half6half_tEPKS9_EEEviDpT0_
                    2.86%  629.54ms      5000  125.91us  11.968us  642.81us  void op_generic_tensor_kernel<int=2, __half, float, __half, int=256, cudnnGenericOp_t=8, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=1>(cudnnTensorStruct, __half*, cudnnTensorStruct, __half const *, cudnnTensorStruct, __half const *, float, float, float, float, dimArray, reducedDivisorArray, bool)
                    2.01%  443.40ms      2500  177.36us  57.855us  485.28us  void cudnn::detail::bn_fw_tr_1C11_singleread_fp16<int=512, int=1, int=2, int=20>(cudnnTensorStruct, __half2 const *, cudnn::detail::bn_fw_tr_1C11_singleread_fp16<int=512, int=1, int=2, int=20>, cudnnTensorStruct*, float const *, float const , float, float, float*, float const *, float const *, float const *, float, float, cudnn::reduced_divisor, int, float*, cudnn::detail::bnFwPersistentState*, int, float, float, float, int, float, float, cudnnStatus_t*, bool)
                    2.01%  441.25ms      1507  292.80us  150.34us  406.05us  turing_fp16_s1688cudnn_fp16_256x128_ldg8_relu_filter1x1_stg8_interior_nchw_nn_v1
                    1.86%  409.88ms      1600  256.18us  69.536us  761.85us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS0_3SumEJPN7mshadow4half7half2_tENS_9OpReqTypeES7_S7_EEEviDpT0_
                    1.81%  397.44ms      1600  248.40us  72.864us  632.19us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op4plusELi1EEEJPN7mshadow4half6half_tESA_SA_EEEviDpT0_
                    1.77%  388.79ms      1216  319.73us  274.40us  1.0458ms  volta_sgemm_128x64_nt
                    1.65%  362.15ms      1105  327.74us  278.27us  390.75us  turing_fp16_s1688cudnn_fp16_256x128_ldg8_relu_f2f_exp_small_nhwc_tn_v1
                    1.37%  300.68ms      4222  71.218us  10.208us  640.96us  void nhwcToNchwKernel<__half, __half, float, bool=1, bool=0>(int, int, int, int, __half const *, __half*, float, float)
                    1.14%  250.31ms      1304  191.96us  163.36us  333.57us  turing_fp16_s1688cudnn_fp16_256x128_ldg8_relu_f2f_exp_interior_nhwc_tn_v1
                    0.92%  202.14ms       100  2.0214ms  1.9949ms  2.0394ms  void cudnn::detail::pooling_bw_kernel_max<__half, float, cudnn::detail::maxpooling_func<float, cudnnNanPropagation_t=0>, bool=0>(cudnnTensorStruct, __half const *, cudnn::detail::pooling_bw_kernel_max<__half, float, cudnn::detail::maxpooling_func<float, cudnnNanPropagation_t=0>, bool=0>, __half const , cudnn::detail::pooling_bw_kernel_max<__half, float, cudnn::detail::maxpooling_func<float, cudnnNanPropagation_t=0>, bool=0>, __half const , cudnn::detail::pooling_bw_kernel_max<__half, float, cudnn::detail::maxpooling_func<float, cudnnNanPropagation_t=0>, bool=0>, cudnnTensorStruct*, cudnnPoolingStruct, float, cudnnPoolingStruct, int, cudnn::reduced_divisor, float)
                    0.91%  200.70ms      5022  39.963us  3.6480us  626.27us  void nhwcToNchwKernel<float, __half, float, bool=1, bool=0>(int, int, int, int, float const *, __half*, float, float)
                    0.90%  199.08ms       802  248.23us  238.33us  957.47us  volta_fp16_s884cudnn_fp16_128x128_ldg8_dgrad_f2f_exp_interior_nhwc2nchw_tt_v1
                    0.86%  189.25ms       303  624.60us  558.97us  907.48us  dgrad_1x1_stride_2x2

FP16 + multi-precision looks very different:

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   11.37%  1.55330s     19997  77.676us  2.1120us  440.13us  void nchwToNhwcKernel<__half, __half, float, bool=1, bool=0>(int, int, int, int, __half const *, __half*, float, float)
                   10.49%  1.43320s      4822  297.22us  128.13us  1.1295ms  turing_s1688cudnn_fp16_128x128_ldg8_wgrad_idx_exp_interior_nhwc_nt_v1
                    6.14%  838.84ms      5000  167.77us  17.887us  817.88us  void op_generic_tensor_kernel<int=2, __half, float, __half, int=256, cudnnGenericOp_t=13, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=2>(cudnnTensorStruct, __half*, cudnnTensorStruct, __half const *, cudnnTensorStruct, __half const *, float, float, float, float, dimArray, reducedDivisorArray, bool)
                    4.90%  670.07ms      2400  279.19us  92.895us  719.13us  void cudnn::detail::bn_bw_1C11_singleread_fp16<int=512, int=1, int=2, int=14>(float, float, float, float, cudnnTensorStruct, __half2 const *, cudnn::detail::bn_bw_1C11_singleread_fp16<int=512, int=1, int=2, int=14>, __half2 const , cudnn::detail::bn_bw_1C11_singleread_fp16<int=512, int=1, int=2, int=14>, cudnnTensorStruct*, float const *, float*, float const *, float const , float const , float, cudnn::reduced_divisor, int, float*, cudnn::detail::bnBwPersistentState*, int, float, float, float, int, float, cudnnStatus_t*, bool)
                    4.73%  646.86ms       430  1.5043ms  734.72us  4.1373ms  void cudnn::detail::wgrad_alg0_engine<__half, int=128, int=6, int=8, int=3, int=3, int=5, bool=1, int=512>(int, int, int, __half const *, int, cudnn::detail::wgrad_alg0_engine<__half, int=128, int=6, int=8, int=3, int=3, int=5, bool=1, int=512>*, __half const , kernel_grad_params, int, float, int, int, int, int)
                    4.61%  629.83ms      5000  125.97us  12.063us  626.20us  void op_generic_tensor_kernel<int=2, __half, float, __half, int=256, cudnnGenericOp_t=8, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=1>(cudnnTensorStruct, __half*, cudnnTensorStruct, __half const *, cudnnTensorStruct, __half const *, float, float, float, float, dimArray, reducedDivisorArray, bool)
                    4.39%  600.18ms      1608  373.25us  261.12us  666.72us  volta_fp16_s884cudnn_fp16_256x128_ldg8_dgrad_f2f_exp_interior_nhwc2nchw_tt_v1
                    4.38%  597.94ms      3200  186.86us  58.720us  412.00us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op8identityELi1EEEJPN7mshadow4half6half_tEPKS9_EEEviDpT0_
                    3.25%  443.86ms      2500  177.55us  57.760us  495.61us  void cudnn::detail::bn_fw_tr_1C11_singleread_fp16<int=512, int=1, int=2, int=20>(cudnnTensorStruct, __half2 const *, cudnn::detail::bn_fw_tr_1C11_singleread_fp16<int=512, int=1, int=2, int=20>, cudnnTensorStruct*, float const *, float const , float, float, float*, float const *, float const *, float const *, float, float, cudnn::reduced_divisor, int, float*, cudnn::detail::bnFwPersistentState*, int, float, float, float, int, float, float, cudnnStatus_t*, bool)
                    3.22%  439.58ms      1507  291.69us  149.92us  411.23us  turing_fp16_s1688cudnn_fp16_256x128_ldg8_relu_filter1x1_stg8_interior_nchw_nn_v1
                    2.91%  397.03ms      1600  248.15us  72.736us  637.12us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op4plusELi1EEEJPN7mshadow4half6half_tESA_SA_EEEviDpT0_
                    2.89%  394.91ms      1600  246.82us  69.727us  565.18us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS0_3SumEJPN7mshadow4half7half2_tENS_9OpReqTypeES7_S7_EEEviDpT0_
                    2.63%  358.72ms      1105  324.64us  276.51us  388.96us  turing_fp16_s1688cudnn_fp16_256x128_ldg8_relu_f2f_exp_small_nhwc_tn_v1
                    2.20%  300.11ms      4222  71.081us  10.048us  432.00us  void nhwcToNchwKernel<__half, __half, float, bool=1, bool=0>(int, int, int, int, __half const *, __half*, float, float)
                    1.82%  248.61ms      1304  190.65us  163.55us  329.79us  turing_fp16_s1688cudnn_fp16_256x128_ldg8_relu_f2f_exp_interior_nhwc_tn_v1
                    1.74%  238.11ms       716  332.55us  273.31us  1.0432ms  volta_sgemm_128x64_nt
                    1.48%  202.58ms       100  2.0258ms  1.9909ms  2.0511ms  void cudnn::detail::pooling_bw_kernel_max<__half, float, cudnn::detail::maxpooling_func<float, cudnnNanPropagation_t=0>, bool=0>(cudnnTensorStruct, __half const *, cudnn::detail::pooling_bw_kernel_max<__half, float, cudnn::detail::maxpooling_func<float, cudnnNanPropagation_t=0>, bool=0>, __half const , cudnn::detail::pooling_bw_kernel_max<__half, float, cudnn::detail::maxpooling_func<float, cudnnNanPropagation_t=0>, bool=0>, __half const , cudnn::detail::pooling_bw_kernel_max<__half, float, cudnn::detail::maxpooling_func<float, cudnnNanPropagation_t=0>, bool=0>, cudnnTensorStruct*, cudnnPoolingStruct, float, cudnnPoolingStruct, int, cudnn::reduced_divisor, float)
                    1.45%  198.56ms       802  247.58us  238.78us  256.09us  volta_fp16_s884cudnn_fp16_128x128_ldg8_dgrad_f2f_exp_interior_nhwc2nchw_tt_v1
                    1.45%  198.10ms       412  480.82us  261.02us  1.2413ms  volta_fp16_scudnn_fp16_128x128_stridedB_interior_nn_v1
                    1.39%  190.39ms       303  628.36us  556.54us  708.03us  dgrad_1x1_stride_2x2
                    1.28%  175.54ms       301  583.18us  555.87us  597.08us  volta_fp16_s884cudnn_fp16_256x64_ldg8_dgrad_f2f_exp_small_nhwc2nchw_tt_v1
                    1.16%  158.07ms       302  523.41us  497.89us  634.91us  volta_fp16_s884cudnn_fp16_256x128_ldg8_dgrad_f2f_exp_small_nhwc2nchw_tt_v1
                    1.15%  157.52ms      4822  32.667us  3.6480us  241.02us  void nhwcToNchwKernel<float, __half, float, bool=1, bool=0>(int, int, int, int, float const *, __half*, float, float)
                    1.10%  150.19ms       502  299.18us  141.79us  431.45us  volta_fp16_s884cudnn_fp16_256x64_ldg8_dgrad_f2f_exp_interior_nhwc2nchw_tt_v1
                    1.09%  148.41ms      1800  82.448us  41.151us  160.13us  void cudnn::detail::bn_bw_1C11_singleread_fp16<int=512, int=1, int=2, int=7>(float, float, float, float, cudnnTensorStruct, __half2 const *, cudnn::detail::bn_bw_1C11_singleread_fp16<int=512, int=1, int=2, int=7>, __half2 const , cudnn::detail::bn_bw_1C11_singleread_fp16<int=512, int=1, int=2, int=7>, cudnnTensorStruct*, float const *, float*, float const *, float const , float const , float, cudnn::reduced_divisor, int, float*, cudnn::detail::bnBwPersistentState*, int, float, float, float, int, float, cudnnStatus_t*, bool)
                    0.97%  132.01ms       332  397.62us  143.10us  7.1807ms  volta_cgemm_32x32_tn
                    0.95%  129.19ms       704  183.51us  124.00us  390.24us  turing_fp16_s1688cudnn_fp16_128x128_ldg8_relu_f2f_exp_interior_nhwc_tn_v1
                    0.83%  113.92ms       101  1.1279ms  1.0902ms  1.5125ms  volta_fp16_scudnn_fp16_128x64_relu_medium_nn_v1
                    0.80%  109.48ms       301  363.71us  354.72us  369.98us  turing_fp16_s1688cudnn_fp16_256x64_ldg8_relu_f2f_exp_small_nhwc_tn_v1
                    0.80%  109.22ms       216  505.65us  48.575us  9.9029ms  volta_gcgemm_32x32_nt

oleg-trott on 29 Feb 2020

@DickJC123 @ptrendx Are there any update for this issue? Would it also affect batched_dot which widely used in attention layers?

sxjscience on 3 Mar 2020

Will take a look at it today.

On Mar 2, 2020, at 10:31 PM, Xingjian Shi notifications@github.com wrote:

@DickJC123 @ptrendx Are there any update for this issue? Would it also affect batched_dot which widely used in attention layers?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.