I want to get network output in numpy array, which can be achieved with the NDArray.asnumpy() method. But I found this very slow since it copies all the elements. The computation takes only a few ms, while the NDArray.asnumpy() call takes 100+ ms. How can I obtain the numpy array more efficiently?
I also tried to directly manipulate the mxnet NDArray. I used slice() or slice_axis(), since "multi-dimension indexing is not supported". But both methods only support slicing a contiguous region. This makes a big trouble for me.
BTW, I run my program on CPU and the arrays are of size ~2000x6.
NDArray computations are async, when you run a = b + c etc, the op is issued to the engine and returns immediately. .asnumpy() is waiting on that computation to be done. You can see this by calling NDArray.wait_to_read() before .asnumpy()
@piiswrong Thanks! My bad. I didn't realize that the computation was actually still running. I understand it now.
Most helpful comment
NDArray computations are async, when you run a = b + c etc, the op is issued to the engine and returns immediately.
.asnumpy()is waiting on that computation to be done. You can see this by calling NDArray.wait_to_read() before.asnumpy()