Incubator-mxnet: asnumpy() of NDArray @cpu halted

Created on 4 Nov 2016 · 28Comments · Source: apache/incubator-mxnet

I am running example/rcnn/demo.py . It succedded on gpu. Then I try to run whole program on cpu. But the program halted at line:
scores = executor.output_dict['cls_prob_reshape_output'].asnumpy()[0]
in file rcnn/detector.py.
then I modified the line to :

```

scores_raw = executor.output_dict['cls_prob_reshape_output']
scores = scores_raw.asnumpy()[0]

it still halted at  scores_raw.asnumpy().
     I have made sample  code as following:

      ```
mx_x = mx.nd.ones((1,300,21))
      np_x = mx_x.asnumpy()[0]

But it succeed. so what happens? does asnumpy() has some bug?

Source

nopattern

All 28 comments

By Debuging into the code , I found the code halted at
check_call(_LIB.MXNDArraySyncCopyToCPU(
self.handle,
data.ctypes.data_as(ctypes.c_void_p),
ctypes.c_size_t(data.size)))
in file ndarray.py.

nopattern on 4 Nov 2016

I have same issues on https://github.com/dmlc/mxnet/issues/3684 also at _LIB.MXNDArraySyncCopyToCPU

zihaolucky on 4 Nov 2016

rcnn is a big model. You need to wait longer

piiswrong on 4 Nov 2016

👎7

The copy size is about 300_21_4=24k, Gpu version is quick .For cpu version ,I waited more than half an hour...

nopattern on 5 Nov 2016

@precedenceguo

piiswrong on 5 Nov 2016

asnumpy awaits engine for the actual computation.
Is you machine conducting heavy calculation while stuck at this line? If so, please use mxnet-notebook/predict_with_pretrained_model to test an image and see how long that takes?

ijkguo on 6 Nov 2016

Thank you .I checked the cpu status and it utilize almost 0 . I will try predict_with_pretrained_model 。

nopattern on 6 Nov 2016

@javelinjs I guess the toArray method in Scala is same as asnumpy in python.

zihaolucky on 7 Nov 2016

I moved context from mx.cpu(0) to mx.cpu(1) and the problem solved . It seems the bug is about thread synchronization.

nopattern on 7 Nov 2016

@nopattern Have you tried to call predict in multi-thread?

zihaolucky on 7 Nov 2016

@zihaolucky How to ?

nopattern on 7 Nov 2016

@nopattern Python multi-thread, call model.predict and asnumpy. Or you could also use waitToRead after asnumpy in single thread.

zihaolucky on 7 Nov 2016

I also met this problem.
I tried to bind two executors with partially shared parameters. If I use exe1.outputs[0].asnumpy() in the training iteration (to inspect whether there is any problem with the training setup), the training process freeze. If I comment out all asnumpy() lines, training goes smoothly. If I bind without any parameter shared, then there is no problem with asnumpy() call

cloudhan on 17 Apr 2017

I thought this issue is fixed in collection #4713 and pr #4528.

ijkguo on 20 Apr 2017

@zihaolucky Hi, How do you resolve this problem? I met the same occasion. Each time I call NDArray.toArray in scala, it will block at least 10 seconds when I run this on a cpu-Server with Centos. however, in my Macbook cpu only, its time cost can be noticed!

maxenceliu on 17 Aug 2017

@maxenceliu Unfortunately, I haven't resolve this.

zihaolucky on 17 Aug 2017

@zihaolucky So, you give up to deploy mxnet on Server at last and use another platform?

maxenceliu on 17 Aug 2017

@maxenceliu I use the Scala package for a while in production. With version 0.7 it works okay. Or maybe you should try naive engine.

zihaolucky on 17 Aug 2017

@zihaolucky Now I use version 1.10, problem appear again?!! What do you mean by naive engine?

maxenceliu on 17 Aug 2017

@zihaolucky actually pre 0.9 version works okey. Seems there have been deadlock since NVVM refactor

cloudhan on 17 Aug 2017

@cloudhan Have you tried 0.10.1 version? Still will block for several seconds. It will call WaitToRead() when copy data to Cpu. I don't understand if all the dependency need to be wait for this copy.

maxenceliu on 17 Aug 2017

@zihaolucky naive engine has not been implemented?

maxenceliu on 17 Aug 2017

@maxenceliu nah, I lost the script and forgot how to reproduce...

cloudhan on 17 Aug 2017

I found I use the docker env in the server. I'm not sure if docker could lead to this problem?

maxenceliu on 21 Aug 2017

I think I have resolved this problem but still don't know why. #7417

maxenceliu on 21 Aug 2017

@maxenceliu
Could you list the system env, including the mx version you're using (you mentioned 1.10 and 0.10.1 but neither is a valid version number) or the git commit version?
Also please provide the shortest snippets that can reproduce the problem. I'll look into it.