import mxnet as mx
from mxnet import autograd, np, npx, gluon, init
from mxnet.gluon import nn
import time
npx.set_np()
data = mx.np.random.uniform(size=(32, 100, 100), ctx=mx.gpu())
label = mx.np.ones((32, 100, 100), ctx=mx.gpu())
net = nn.Sequential()
net.add(nn.BatchNorm(axis=-1))
net.initialize(init.Xavier(), ctx=mx.gpu())
loss = gluon.loss.L2Loss()
t = time.time()
for _ in range(5000):
with autograd.record():
l = loss(net(data), label)
l.backward()
mx.nd.waitall()
print('spent: {}s'.format(time.time() - t))
MXNet version: static build with branch v1.7x commit 75ab15569bd0f20a90806ce2fc38df08be208ed7
I got around 5 sec with axis=1 and 30 sec with axis=-1 on P3.8xlarge (V100).
Both of case are computing the 32 * 100 data for each axis
similar to https://github.com/apache/incubator-mxnet/issues/10095
Thanks @ptrendx to point out that cudnn 7.4 (https://docs.nvidia.com/deeplearning/sdk/cudnn-release-notes/rel_7xx.html#rel_741) added a new cudnnBatchNormalization*Ex API that gives much better speed for axis = -1
The reason is that MKLDNN and CuDNN are only applied when axis = 1.
The open PR https://github.com/apache/incubator-mxnet/pull/18504 fixes it.
However, we will replace mkldnn_off and cudnn_off attributes with environment variables, so the PR is blocked.
@wkcn Thanks for you detailed explanation.
So I think there are two phrases.
cudnnBatchNormalizationForwardTrainingEx for NHWC case (I checked the source code, we are all using cudnnBatchNormalizationForwardTraining)I think NHWC layout is very important in point cloud algorithms.
I have verified the performance is almost the same after the fix https://github.com/apache/incubator-mxnet/pull/18504. Close the issue
Most helpful comment
I think NHWC layout is very important in point cloud algorithms.