Incubator-mxnet: mx.nd.argmax is slow

Created on 25 May 2018 · 19Comments · Source: apache/incubator-mxnet

import time
import mxnet as mx
import numpy as np
import os
tmp = mx.nd.random.uniform(-1, 1, shape=(64,300000), ctx=mx.gpu())
#tmp = np.random.RandomState().uniform(-1, 1, (64,300000))
tic = time.time()
for i in range(20):
   if (i == 5):
        begin = time.time();
   elif (i == 15):
        end = time.time();
   tic = time.time()
   max = mx.nd.argmax(tmp, axis=1)
   #max = np.argmax(tmp, axis=1)
   print max[0]
   toc = time.time() - tic
   print ("used time %f:"%toc)
avg = (end - begin) / 10
print ("avg time %f"%avg)

We use code before to test the speed of argmax.
NDArray in cpu use 0.112895s.
NDArray in gpu use 0.280577.
numpy.array only use 0.018021.

Why mx.nd.argmax is so slow?

@szha

NDArray Performance v2.0

Source

solin319

👍3

Most helpful comment

I think we should try to remove the usage of https://github.com/apache/incubator-mxnet/blob/9de2a48fabf1b8a60eb539640dea4c0637b5522b/src/operator/numpy/np_broadcast_reduce_op.h#L472 and reuse the
ReduceAxesComputeImpl https://github.com/apache/incubator-mxnet/blob/9268f89080d84d37aadb295f918515fd6d43436e/src/operator/tensor/broadcast_reduce_op.h#L621-L650

For argmax, we can write a customized Reducer (https://github.com/apache/incubator-mxnet/blob/29d6f273f3f5e893e317b2d17bc2a108eb41b32f/src/operator/mshadow_op.h#L1475-L1533) that returns both the max value and its location.

To provide more context, we can revise our argmax kernel to be similar to https://github.com/apache/incubator-mxnet/blob/9268f89080d84d37aadb295f918515fd6d43436e/src/operator/tensor/broadcast_reduce_op.h#L685-L701

sxjscience on 30 Sep 2020

👍2 🚀1

All 19 comments

remove the line "print max[0]" mx.nd will be faster.

z01nl1o02 on 25 May 2018

@z01nl1o02
"print max[0]" is used to make sure argmax completed.
We can change it to 'max.wait_to_read()', the result is same.

solin319 on 26 May 2018

In my experiment, the function nd.topk is much faster than nd.argmax, but I just want to know why the nd.argmax is so slow?

puppet101 on 4 Sep 2018

Same issue, nd.argmax() is really slow, close to unusable in real world scenarios

edit: effectively as @puppet101 mentioned using nd.topk(axis=..., ret_typ='indices') is a lot faster and achieve the same thing. We should just alias it maybe.

ThomasDelteil on 5 Feb 2019

It's a bit embarrassing...IIRC, our argmax is done through a sequential reduce with no parallelism...
@solin319 how much faster is topk with k=1?

szha on 5 Feb 2019

@solin319 sorry I just realized how late I was on this. I will take a look.

szha on 5 Feb 2019

@szha for my use-case topk is approx 100x faster.
the GPU version is also 10 times slower than the CPU one.

data = mx.nd.random.uniform(shape=(1,384*384), ctx=mx.gpu())

%%time
for i in range(100):
    nd.argmax(data,axis=1).wait_to_read()
CPU times: user 192 ms, sys: 112 ms, total: 304 ms
Wall time: 2.72s



md5-0fc18b7c8da1dae85c1885faef2972b3



%%time
for i in range(100):
    nd.topk(data, ret_typ='indices', axis=1).wait_to_read()
CPU times: user 28 ms, sys: 8 ms, total: 36 ms
Wall time: 31.9 ms

ThomasDelteil on 6 Feb 2019

👍1

In this case it's probably easier to just switch the kernel to TopkImpl. @ThomasDelteil would you be interested in taking a stab at it?

szha on 6 Feb 2019

We'd be interested in having this change as well. Sockeye is also has code with argmin but intentionally does not use it: https://github.com/awslabs/sockeye/blob/master/sockeye/inference.py#L2213

fhieber on 6 Feb 2019

@szha I don't think I have the time these coming weeks unfortunately working against some tight deadlines

ThomasDelteil on 6 Feb 2019

Is argsort affected as well?
(edited)

ifeherva on 6 Feb 2019

I am interested in this, too. This is taking a noticeable portion of time when training BERT. In my case, each invocation takes ~8ms, while other operators in the network take ~0.1ms each

Also, topk with fp16 gives me:

mxnet.base.MXNetError: [06:49:36] src/operator/tensor/./ordering_op-inl.h:551: This operation does not support float16

Stack trace returned 10 entries:
[bt] (0) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x1bc) [0x7f9bc531df1c]
[bt] (1) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f9bc531f298]
[bt] (2) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(void mxnet::op::TopK<mshadow::gpu>(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0x270) [0x7f9bc8d22ea0]
[bt] (3) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)::{lambda(mxnet::RunContext)#1}::operator()(mxnet::RunContext) const+0x10a2) [0x7f9bc7f34e52]
[bt] (4) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (mxnet::RunContext), mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)::{lambda(mxnet::RunContext)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext&&)+0x14) [0x7f9bc7f35224]
[bt] (5) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(+0x41de1e5) [0x7f9bc86741e5]
[bt] (6) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x8f2) [0x7f9bc866e6b2]
[bt] (7) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&)+0x10d) [0x7f9bc868564d]
[bt] (8) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#4}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&)+0x4e) [0x7f9bc86858be]
[bt] (9) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run()+0x4a) [0x7f9bc866dcaa]

eric-haibin-lin on 12 Feb 2019

Also both CPU and GPU argmax becomes slow when the number of category is large (e.g. vocab size = 30K).

eric-haibin-lin on 12 Feb 2019

Even in the latest MXNet 2.0, this performance issue was not solved:

Speed of topk(k=1)

import time
import mxnet as mx
import numpy as np
import os
tmp = mx.np.random.normal(-1, 1, (64,300000), ctx=mx.gpu())
tic = time.time()
for i in range(20):
   if (i == 5):
        begin = time.time();
   elif (i == 15):
        end = time.time();
   tic = time.time()
   out = mx.npx.topk(tmp, k=1, ret_typ='indices', axis=1, dtype=np.int32)
   out.wait_to_read()
   toc = time.time() - tic
   print ("used time %f:"%toc)
avg = (end - begin) / 10
print("avg time %f"%avg)

Output:

used time 0.001522:
used time 0.001141:
used time 0.001141:
used time 0.001094:
used time 0.001081:
used time 0.001080:
used time 0.001086:
used time 0.001089:
used time 0.001075:
used time 0.001085:
used time 0.00108

Speed of argmax:

import time
import mxnet as mx
import numpy as np
import os
tmp = mx.np.random.normal(-1, 1, (64,300000), ctx=mx.gpu())
tic = time.time()
for i in range(20):
   if (i == 5):
        begin = time.time();
   elif (i == 15):
        end = time.time();
   tic = time.time()
   out = mx.np.argmax(tmp, axis=1)
   out.wait_to_read()
   toc = time.time() - tic
   print ("used time %f:"%toc)
avg = (end - begin) / 10
print("avg time %f"%avg)

Output

used time 0.063610:
used time 0.060543:
used time 0.059911:
used time 0.059865:
used time 0.055048:
used time 0.054780:
used time 0.054773:
used time 0.054817:
used time 0.054653:
used time 0.054758:
used time 0.054723:
used time 0.054765:
used time 0.054680:
used time 0.054769:

Speed of PyTorch

import time
import torch
import numpy as np
import os
tmp = torch.normal(-1, 1, (64,300000)).cuda()
tic = time.time()
for i in range(20):
   if (i == 5):
        begin = time.time();
   elif (i == 15):
        end = time.time();
   tic = time.time()
   out = tmp.argmax().cpu().numpy()
   toc = time.time() - tic
   print ("used time %f:"%toc)
avg = (end - begin) / 10
print("avg time %f"%avg)

used time 0.000696:
used time 0.000167:
used time 0.000153:
used time 0.000149:
used time 0.000148:
used time 0.000151:
used time 0.000148:
used time 0.000148:
used time 0.000148:
used time 0.000148:
used time 0.000155:
used time 0.000149:
used time 0.000148:
used time 0.000148:
used time 0.000148:
used time 0.000156:
used time 0.000149:
used time 0.000149:
used time 0.000147:

sxjscience on 30 Sep 2020

👍2

Actually, the latest profiling number shows that the argmax in MXNet is 362 times slower than that in PyTorch...

sxjscience on 30 Sep 2020