import time
import mxnet as mx
import numpy as np
import os
tmp = mx.nd.random.uniform(-1, 1, shape=(64,300000), ctx=mx.gpu())
#tmp = np.random.RandomState().uniform(-1, 1, (64,300000))
tic = time.time()
for i in range(20):
if (i == 5):
begin = time.time();
elif (i == 15):
end = time.time();
tic = time.time()
max = mx.nd.argmax(tmp, axis=1)
#max = np.argmax(tmp, axis=1)
print max[0]
toc = time.time() - tic
print ("used time %f:"%toc)
avg = (end - begin) / 10
print ("avg time %f"%avg)
We use code before to test the speed of argmax.
NDArray in cpu use 0.112895s.
NDArray in gpu use 0.280577.
numpy.array only use 0.018021.
Why mx.nd.argmax is so slow?
@szha
remove the line "print max[0]" mx.nd will be faster.
@z01nl1o02
"print max[0]" is used to make sure argmax completed.
We can change it to 'max.wait_to_read()', the result is same.
In my experiment, the function nd.topk is much faster than nd.argmax, but I just want to know why the nd.argmax is so slow?
Same issue, nd.argmax() is really slow, close to unusable in real world scenarios
edit: effectively as @puppet101 mentioned using nd.topk(axis=..., ret_typ='indices') is a lot faster and achieve the same thing. We should just alias it maybe.
It's a bit embarrassing...IIRC, our argmax is done through a sequential reduce with no parallelism...
@solin319 how much faster is topk with k=1?
@solin319 sorry I just realized how late I was on this. I will take a look.
@szha for my use-case topk is approx 100x faster.
the GPU version is also 10 times slower than the CPU one.
data = mx.nd.random.uniform(shape=(1,384*384), ctx=mx.gpu())
%%time
for i in range(100):
nd.argmax(data,axis=1).wait_to_read()
CPU times: user 192 ms, sys: 112 ms, total: 304 ms
Wall time: 2.72s
md5-0fc18b7c8da1dae85c1885faef2972b3
%%time
for i in range(100):
nd.topk(data, ret_typ='indices', axis=1).wait_to_read()
CPU times: user 28 ms, sys: 8 ms, total: 36 ms
Wall time: 31.9 ms
In this case it's probably easier to just switch the kernel to TopkImpl. @ThomasDelteil would you be interested in taking a stab at it?
We'd be interested in having this change as well. Sockeye is also has code with argmin but intentionally does not use it: https://github.com/awslabs/sockeye/blob/master/sockeye/inference.py#L2213
@szha I don't think I have the time these coming weeks unfortunately working against some tight deadlines
Is argsort affected as well?
(edited)
I am interested in this, too. This is taking a noticeable portion of time when training BERT. In my case, each invocation takes ~8ms, while other operators in the network take ~0.1ms each
Also, topk with fp16 gives me:
mxnet.base.MXNetError: [06:49:36] src/operator/tensor/./ordering_op-inl.h:551: This operation does not support float16
Stack trace returned 10 entries:
[bt] (0) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x1bc) [0x7f9bc531df1c]
[bt] (1) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f9bc531f298]
[bt] (2) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(void mxnet::op::TopK<mshadow::gpu>(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0x270) [0x7f9bc8d22ea0]
[bt] (3) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)::{lambda(mxnet::RunContext)#1}::operator()(mxnet::RunContext) const+0x10a2) [0x7f9bc7f34e52]
[bt] (4) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (mxnet::RunContext), mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)::{lambda(mxnet::RunContext)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext&&)+0x14) [0x7f9bc7f35224]
[bt] (5) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(+0x41de1e5) [0x7f9bc86741e5]
[bt] (6) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x8f2) [0x7f9bc866e6b2]
[bt] (7) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&)+0x10d) [0x7f9bc868564d]
[bt] (8) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#4}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&)+0x4e) [0x7f9bc86858be]
[bt] (9) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run()+0x4a) [0x7f9bc866dcaa]
Also both CPU and GPU argmax becomes slow when the number of category is large (e.g. vocab size = 30K).
Even in the latest MXNet 2.0, this performance issue was not solved:
import time
import mxnet as mx
import numpy as np
import os
tmp = mx.np.random.normal(-1, 1, (64,300000), ctx=mx.gpu())
tic = time.time()
for i in range(20):
if (i == 5):
begin = time.time();
elif (i == 15):
end = time.time();
tic = time.time()
out = mx.npx.topk(tmp, k=1, ret_typ='indices', axis=1, dtype=np.int32)
out.wait_to_read()
toc = time.time() - tic
print ("used time %f:"%toc)
avg = (end - begin) / 10
print("avg time %f"%avg)
Output:
used time 0.001522:
used time 0.001141:
used time 0.001141:
used time 0.001094:
used time 0.001081:
used time 0.001080:
used time 0.001086:
used time 0.001089:
used time 0.001075:
used time 0.001085:
used time 0.00108
import time
import mxnet as mx
import numpy as np
import os
tmp = mx.np.random.normal(-1, 1, (64,300000), ctx=mx.gpu())
tic = time.time()
for i in range(20):
if (i == 5):
begin = time.time();
elif (i == 15):
end = time.time();
tic = time.time()
out = mx.np.argmax(tmp, axis=1)
out.wait_to_read()
toc = time.time() - tic
print ("used time %f:"%toc)
avg = (end - begin) / 10
print("avg time %f"%avg)
Output
used time 0.063610:
used time 0.060543:
used time 0.059911:
used time 0.059865:
used time 0.055048:
used time 0.054780:
used time 0.054773:
used time 0.054817:
used time 0.054653:
used time 0.054758:
used time 0.054723:
used time 0.054765:
used time 0.054680:
used time 0.054769:
import time
import torch
import numpy as np
import os
tmp = torch.normal(-1, 1, (64,300000)).cuda()
tic = time.time()
for i in range(20):
if (i == 5):
begin = time.time();
elif (i == 15):
end = time.time();
tic = time.time()
out = tmp.argmax().cpu().numpy()
toc = time.time() - tic
print ("used time %f:"%toc)
avg = (end - begin) / 10
print("avg time %f"%avg)
used time 0.000696:
used time 0.000167:
used time 0.000153:
used time 0.000149:
used time 0.000148:
used time 0.000151:
used time 0.000148:
used time 0.000148:
used time 0.000148:
used time 0.000148:
used time 0.000155:
used time 0.000149:
used time 0.000148:
used time 0.000148:
used time 0.000148:
used time 0.000156:
used time 0.000149:
used time 0.000149:
used time 0.000147:
Actually, the latest profiling number shows that the argmax in MXNet is 362 times slower than that in PyTorch...
I think we should try to remove the usage of https://github.com/apache/incubator-mxnet/blob/9de2a48fabf1b8a60eb539640dea4c0637b5522b/src/operator/numpy/np_broadcast_reduce_op.h#L472 and reuse the
ReduceAxesComputeImpl https://github.com/apache/incubator-mxnet/blob/9268f89080d84d37aadb295f918515fd6d43436e/src/operator/tensor/broadcast_reduce_op.h#L621-L650
For argmax, we can write a customized Reducer (https://github.com/apache/incubator-mxnet/blob/29d6f273f3f5e893e317b2d17bc2a108eb41b32f/src/operator/mshadow_op.h#L1475-L1533) that returns both the max value and its location.
To provide more context, we can revise our argmax kernel to be similar to https://github.com/apache/incubator-mxnet/blob/9268f89080d84d37aadb295f918515fd6d43436e/src/operator/tensor/broadcast_reduce_op.h#L685-L701
@Zha0q1 - Can you please keep this in your TODO list. looks like interesting and important optimization.
FYI @mseth10 @josephevans
@Zha0q1 Ping me any time if you are blocked.
Yeah looks like an important fix. Will find time to do it
Most helpful comment
I think we should try to remove the usage of https://github.com/apache/incubator-mxnet/blob/9de2a48fabf1b8a60eb539640dea4c0637b5522b/src/operator/numpy/np_broadcast_reduce_op.h#L472 and reuse the
ReduceAxesComputeImplhttps://github.com/apache/incubator-mxnet/blob/9268f89080d84d37aadb295f918515fd6d43436e/src/operator/tensor/broadcast_reduce_op.h#L621-L650For
argmax, we can write a customized Reducer (https://github.com/apache/incubator-mxnet/blob/29d6f273f3f5e893e317b2d17bc2a108eb41b32f/src/operator/mshadow_op.h#L1475-L1533) that returns both themaxvalue and its location.To provide more context, we can revise our argmax kernel to be similar to https://github.com/apache/incubator-mxnet/blob/9268f89080d84d37aadb295f918515fd6d43436e/src/operator/tensor/broadcast_reduce_op.h#L685-L701