ray 🚀 - async training with apache MXNet

Hi @ehsanmok, do you mind sharing some source code? It would help us understand what object isn't serializing correctly.

richardliaw on 13 Dec 2017

@richardliaw here's the relevant summarized code:

EDITED! will update soon.

ehsanmok on 13 Dec 2017

👍1

I just tried

import mxnet as mx
import ray

ray.init()

x = mx.nd.array([[1, 2, 3], [4, 5, 6]])

Then

>>> x_id = ray.put(x)
WARNING: Falling back to serializing objects of type <class 'mxnet.ndarray.ndarray.NDArray'> by using pickle. This may be inefficient.
Exception ignored in: <object repr() failed>
Traceback (most recent call last):
  File "/Users/rkn/anaconda3/lib/python3.6/site-packages/mxnet/_ctypes/ndarray.py", line 51, in __del__
    check_call(_LIB.MXNDArrayFree(self.handle))
AttributeError: handle

>>> ray.get(x_id)
[[ 1.  2.  3.]
 [ 4.  5.  6.]]
<NDArray 2x3 @cpu(0)>

>>> x
[[ 1.  2.  3.]
 [ 4.  5.  6.]]
<NDArray 2x3 @cpu(0)>

so serialization of mxnet tensors gives a warning, but seems to behave correctly.

The application seems to be crashing because of the error

TypeError: can't pickle _thread.RLock objects

can you see which line is generating that? Do you have a remote function or actor which closes over an mxnet neural net?

robertnishihara on 13 Dec 2017

@robertnishihara thanks for your reply! sorry, I got caught up fixing it! I think the problem is about gradients updates (which are of type mx.nd.NDArray) in parameter server (actor) and since I'm using an older version of MXNet, getting the right params is a little headache! I'll update on this.

ehsanmok on 13 Dec 2017

@robertnishihara just out of curiosity, do you think you can leverage ONNX in ray? https://github.com/onnx/onnx

ehsanmok on 13 Dec 2017

Yes, Ray and ONNX are completely compatible. ONNX should make it easier to support more neural net frameworks.

robertnishihara on 13 Dec 2017

👍2 🎉1

@richardliaw @robertnishihara I found the bug was in the way MXNet returns the intermediate gradients in python via gradients = module._exec_group.grad_arrays which is a List[List[mx.nd.NDArray]] (long live static typed languages!) so the problem was in push method of ParameterServer and I changed the add to (inplace element-wise add with) self.weights[key] += value[0] and it works now but as you also showed above there're serialization warnings

WARNING: Serializing objects of type <class 'ray.signature.FunctionSignature'> by expanding them as dictionaries of their fields. This behavior may be incorrect in some cases.
WARNING: Serializing objects of type <class 'mxnet.io.NDArrayIter'> by expanding them as dictionaries of their fields. This behavior may be incorrect in some cases

Do you have any suggestions for better serializations of this sort?

ehsanmok on 13 Dec 2017

I think those warnings are harmless in this case, but it would make sense to provide native support for mx.NDArrays later on.

richardliaw on 13 Dec 2017

👍3

@richardliaw @robertnishihara actually I ran my async training test locally with 5 workers on CPU built version of MXNet (USE_CUDA=0 USE_CUDNN=0) (no suspect of my current reported this issue) to train lenet on mnist and after a while it crashes with no particular reason. Here's the complete logs:

Waiting for redis server at 127.0.0.1:56356 to respond...
Waiting for redis server at 127.0.0.1:38384 to respond...
Allowing the Plasma store to use up to 63.3832GB of memory.
Starting object store with directory /dev/shm and huge page support disabled
Starting local scheduler with 20 CPUs, 0 GPUs

======================================================================
View the web UI at http://localhost:8907/notebooks/ray_ui23072.ipynb?token=a0443352db2b82077d0d1aca166628e767e9d287a11e986b
======================================================================

WARNING: Falling back to serializing objects of type <class 'mxnet.ndarray.NDArray'> by using pickle. This may be inefficient.
Exception ignored in: <object repr() failed>
Traceback (most recent call last):
  File "/home/ilab/anaconda3/envs/bmxnet-cpu/lib/python3.6/site-packages/mxnet-0.10.1-py3.6-linux-x86_64.egg/mxnet/_ctypes/ndarray.py", line 36, in __del__
    check_call(_LIB.MXNDArrayFree(self.handle))
AttributeError: handle
WARNING: Serializing objects of type <class 'ray.signature.FunctionSignature'> by expanding them as dictionaries of their fields. This behavior may be incorrect in some cases.
WARNING: Serializing objects of type <class 'mxnet.io.NDArrayIter'> by expanding them as dictionaries of their fields. This behavior may be incorrect in some cases.
INFO:root:Epoch[0] Batch [10]   Speed: 200.88 samples/sec       accuracy=0.110909
INFO:root:Epoch[0] Batch [20]   Speed: 188.80 samples/sec       accuracy=0.104000
INFO:root:Epoch[0] Batch [30]   Speed: 177.89 samples/sec       accuracy=0.123000
INFO:root:Epoch[0] Batch [40]   Speed: 172.48 samples/sec       accuracy=0.111000
INFO:root:Epoch[0] Batch [50]   Speed: 182.57 samples/sec       accuracy=0.096000
INFO:root:Epoch[0] Batch [60]   Speed: 166.09 samples/sec       accuracy=0.104000
INFO:root:Epoch[0] Batch [70]   Speed: 168.58 samples/sec       accuracy=0.110000
INFO:root:Epoch[0] Batch [80]   Speed: 173.67 samples/sec       accuracy=0.105000
INFO:root:Epoch[0] Batch [90]   Speed: 186.25 samples/sec       accuracy=0.107000
INFO:root:Epoch[0] Batch [100]  Speed: 182.93 samples/sec       accuracy=0.130000
INFO:root:Epoch[0] Batch [110]  Speed: 177.66 samples/sec       accuracy=0.118000
INFO:root:Epoch[0] Batch [120]  Speed: 170.34 samples/sec       accuracy=0.116000
INFO:root:Epoch[0] Batch [130]  Speed: 181.30 samples/sec       accuracy=0.114000
INFO:root:Epoch[0] Batch [140]  Speed: 176.14 samples/sec       accuracy=0.119000
INFO:root:Epoch[0] Batch [150]  Speed: 181.17 samples/sec       accuracy=0.103000
INFO:root:Epoch[0] Batch [160]  Speed: 181.85 samples/sec       accuracy=0.129000
INFO:root:Epoch[0] Batch [170]  Speed: 175.28 samples/sec       accuracy=0.110000
INFO:root:Epoch[0] Batch [180]  Speed: 175.60 samples/sec       accuracy=0.113000
INFO:root:Epoch[0] Batch [190]  Speed: 177.59 samples/sec       accuracy=0.114000
INFO:root:Epoch[0] Batch [200]  Speed: 182.92 samples/sec       accuracy=0.129000
INFO:root:Epoch[0] Batch [210]  Speed: 186.09 samples/sec       accuracy=0.130000
INFO:root:Epoch[0] Batch [220]  Speed: 170.58 samples/sec       accuracy=0.125000
INFO:root:Epoch[0] Batch [230]  Speed: 182.92 samples/sec       accuracy=0.123000
INFO:root:Epoch[0] Batch [240]  Speed: 187.25 samples/sec       accuracy=0.112000
INFO:root:Epoch[0] Batch [250]  Speed: 180.10 samples/sec       accuracy=0.122000
INFO:root:Epoch[0] Batch [260]  Speed: 180.23 samples/sec       accuracy=0.095000
INFO:root:Epoch[0] Batch [270]  Speed: 176.00 samples/sec       accuracy=0.109000
INFO:root:Epoch[0] Batch [280]  Speed: 186.35 samples/sec       accuracy=0.125000
INFO:root:Epoch[0] Batch [290]  Speed: 189.34 samples/sec       accuracy=0.104000
INFO:root:Epoch[0] Batch [300]  Speed: 186.79 samples/sec       accuracy=0.116000
INFO:root:Epoch[0] Batch [310]  Speed: 176.49 samples/sec       accuracy=0.099000
INFO:root:Epoch[0] Batch [320]  Speed: 162.21 samples/sec       accuracy=0.106000
INFO:root:Epoch[0] Batch [330]  Speed: 160.42 samples/sec       accuracy=0.106000
INFO:root:Epoch[0] Batch [340]  Speed: 174.17 samples/sec       accuracy=0.105000
INFO:root:Epoch[0] Batch [350]  Speed: 168.97 samples/sec       accuracy=0.098000
INFO:root:Epoch[0] Batch [360]  Speed: 190.36 samples/sec       accuracy=0.121000
INFO:root:Epoch[0] Batch [370]  Speed: 184.14 samples/sec       accuracy=0.101000
INFO:root:Epoch[0] Batch [380]  Speed: 186.22 samples/sec       accuracy=0.114000
INFO:root:Epoch[0] Batch [390]  Speed: 208.58 samples/sec       accuracy=0.120000
INFO:root:Epoch[0] Batch [400]  Speed: 182.48 samples/sec       accuracy=0.108000
INFO:root:Epoch[0] Batch [410]  Speed: 180.31 samples/sec       accuracy=0.108000
INFO:root:Epoch[0] Batch [420]  Speed: 175.86 samples/sec       accuracy=0.117000
INFO:root:Epoch[0] Batch [430]  Speed: 167.76 samples/sec       accuracy=0.123000
INFO:root:Epoch[0] Batch [440]  Speed: 183.33 samples/sec       accuracy=0.112000
INFO:root:Epoch[0] Batch [450]  Speed: 173.45 samples/sec       accuracy=0.120000
INFO:root:Epoch[0] Batch [460]  Speed: 200.67 samples/sec       accuracy=0.095000
INFO:root:Epoch[0] Batch [470]  Speed: 180.69 samples/sec       accuracy=0.126000
INFO:root:Epoch[0] Batch [480]  Speed: 180.14 samples/sec       accuracy=0.105000
INFO:root:Epoch[0] Batch [490]  Speed: 187.03 samples/sec       accuracy=0.100000
INFO:root:Epoch[0] Batch [500]  Speed: 179.88 samples/sec       accuracy=0.111000
INFO:root:Epoch[0] Batch [510]  Speed: 172.18 samples/sec       accuracy=0.103000
INFO:root:Epoch[0] Batch [520]  Speed: 172.94 samples/sec       accuracy=0.119000
INFO:root:Epoch[0] Batch [530]  Speed: 181.15 samples/sec       accuracy=0.116000
INFO:root:Epoch[0] Batch [540]  Speed: 185.95 samples/sec       accuracy=0.109000
INFO:root:Epoch[0] Batch [550]  Speed: 181.31 samples/sec       accuracy=0.114000
INFO:root:Epoch[0] Batch [560]  Speed: 194.81 samples/sec       accuracy=0.108000
INFO:root:Epoch[0] Batch [570]  Speed: 176.30 samples/sec       accuracy=0.119000
INFO:root:Epoch[0] Batch [580]  Speed: 199.54 samples/sec       accuracy=0.103000
INFO:root:Epoch[0] Batch [590]  Speed: 194.18 samples/sec       accuracy=0.115000
Disconnecting client on fd 9
Disconnecting client on fd 28
Disconnecting client on fd 24
Disconnecting client on fd 23
Disconnecting client on fd 29
Disconnecting client on fd 22
Disconnecting client on fd 21
Disconnecting client on fd 20
Disconnecting client on fd 19
Disconnecting client on fd 18
Disconnecting client on fd 17
Disconnecting client on fd 16
Disconnecting client on fd 15
Disconnecting client on fd 14
Disconnecting client on fd 13
Disconnecting client on fd 12
Disconnecting client on fd 25
[WARN] (/ray/src/local_scheduler/local_scheduler_client.cc:112) Exiting because local scheduler closed connection.
[WARN] (/ray/src/global_scheduler/global_scheduler.cc:404) Missed too many heartbeats from local scheduler, marking as dead.

ehsanmok on 13 Dec 2017

@richardliaw @robertnishihara sorry guys, this turns our not to be an issue. It's just at the end of epoch it does that. It'd be better to end things more gracefully though (perhaps with contextmanager!) I tested this on mxnet-cpu and mxnet-gpu and it works fine.

ehsanmok on 13 Dec 2017

You're referring to all of the logging statements

Disconnecting client on fd 9
Disconnecting client on fd 28
Disconnecting client on fd 24
Disconnecting client on fd 23
Disconnecting client on fd 29
Disconnecting client on fd 22
Disconnecting client on fd 21
Disconnecting client on fd 20
Disconnecting client on fd 19
Disconnecting client on fd 18
Disconnecting client on fd 17
Disconnecting client on fd 16
Disconnecting client on fd 15
Disconnecting client on fd 14
Disconnecting client on fd 13
Disconnecting client on fd 12
Disconnecting client on fd 25
[WARN] (/ray/src/local_scheduler/local_scheduler_client.cc:112) Exiting because local scheduler closed connection.
[WARN] (/ray/src/global_scheduler/global_scheduler.cc:404) Missed too many heartbeats from local scheduler, marking as dead.

that happen when the driver exits? You're right we should get rid of that.

robertnishihara on 14 Dec 2017

@ehsanmok I created #1324 to keep track of that.

robertnishihara on 14 Dec 2017

👍1

@robertnishihara yes, exactly! More graceful exist would prevent confusion. I confused the above with driver failure.

ehsanmok on 15 Dec 2017

👍1

Closing because I believe this has been resolved (aside from the cleaner warning messages). Please reopen if there are still issues!

robertnishihara on 21 Dec 2017

Ray: async training with apache MXNet

System information

Describe the problem

Source code / logs

View the web UI at http://localhost:8889/notebooks/ray_ui53455.ipynb?token=06b00 e96c38fa8c78dbf8818f24c75badd834d46e0b7d7f4

Most helpful comment

All 14 comments

Related issues