Ray: async training with apache MXNet

Created on 13 Dec 2017  路  14Comments  路  Source: ray-project/ray

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
  • Ray installed from (source or binary): pip install ray
  • Ray version: '0.3.0'
  • Python version: Python 3.6.3 | Anaconda, Inc

Describe the problem

Following ray/examples/parameter_server/async_parameter_server.py, I am trying to use ray for distributed training with apache MXNet and for some reason I'm not using their kvstore functionality. I have basically translated your ray/examples/parameter_server/model.py in MXNet API.

When I run my code (even with 1 worker) I get the following serialization error. Do you know of any workaround to this?

Source code / logs

/home/ilab/anaconda3/envs/bmxnet/lib/python3.6/site-packages/cloudpickle/cloudpi ckle.py:47: DeprecationWarning: the imp module is deprecated in favour of import lib; see the module's documentation for alternative uses
import imp
Waiting for redis server at 127.0.0.1:55905 to respond...
Waiting for redis server at 127.0.0.1:13781 to respond...
Allowing the Plasma store to use up to 63.3832GB of memory.
Starting object store with directory /dev/shm and huge page support disabled
Starting local scheduler with 20 CPUs, 0 GPUs

======================================================================

View the web UI at http://localhost:8889/notebooks/ray_ui53455.ipynb?token=06b00 e96c38fa8c78dbf8818f24c75badd834d46e0b7d7f4

INFO:root:train-labels-idx1-ubyte.gz exists, skipping download
INFO:root:train-images-idx3-ubyte.gz exists, skipping download
INFO:root:t10k-labels-idx1-ubyte.gz exists, skipping download
INFO:root:t10k-images-idx3-ubyte.gz exists, skipping download
WARNING: Falling back to serializing objects of type by using pickle. This may be inefficient.
Exception ignored in:
Traceback (most recent call last):
File "/home/ilab/anaconda3/envs/bmxnet/lib/python3.6/site-packages/mxnet-0.10. 1-py3.6-linux-x86_64.egg/mxnet/_ctypes/ndarray.py", line 36, in __del__
check_call(_LIB.MXNDArrayFree(self.handle))
AttributeError: handle
WARNING: Serializing objects of type b y expanding them as dictionaries of their fields. This behavior may be incorrect in some cases.
WARNING: Serializing objects of type by expanding them as dictionaries of their fields. This behavior may be incorrect in some ca ses.
WARNING: Serializing objects of type by expanding them as dictionaries of their fields. This behavior may be incorrect in some cases.
WARNING: Serializing objects of type by expanding them as dictionaries of their fields. This behavior may be incorrect in some cases.
WARNING: Falling back to serializing objects of type by using pickle. This may be inefficient.
Exception ignored in:
Traceback (most recent call last):
File "/home/ilab/anaconda3/envs/bmxnet/lib/python3.6/site-packages/mxnet-0.10.1-py3.6-linux-x86_64.egg/mxnet/_ctypes/symbol.py", line 29, in __del__
check_call(_LIB.NNSymbolFree(self.handle))
AttributeError: handle
WARNING: Serializing objects of type by expanding them as dictionaries of their fields. This behavior may be incorrect in some cases.
WARNING: Serializing objects of type by expanding them as dictionaries of their fields. This behavior may be incorrect in some cases.
WARNING: Serializing objects of type by expanding them as dictionaries of their fields. This behavior may be incorrect in some cases.
WARNING: Serializing objects of type by expanding them as dictionaries of their fields. This behavior may be incorrect in some cases.
WARNING: Falling back to serializing objects of type by using pickle. This may be inefficient.
WARNING: Serializing objects of type by expanding them as dictionaries of their fields. This behavior may be incorrect in some cases.
WARNING: Serializing objects of type by expanding them as dictionaries of their fields. This behavior may be incorrect in some cases.
WARNING: Falling back to serializing objects of type by using pickle. This may be inefficient.
Traceback (most recent call last):
File "cnn_mnist.py", line 158, in
main(args)
File "cnn_mnist.py", line 112, in main
for _ in range(args.num_workers)]
File "cnn_mnist.py", line 112, in
for _ in range(args.num_workers)]
File "/home/ilab/anaconda3/envs/bmxnet/lib/python3.6/site-packages/ray/worker.py", line 2451, in func_call
objectids = _submit_task(function_id, args)
File "/home/ilab/anaconda3/envs/bmxnet/lib/python3.6/site-packages/ray/worker.py", line 2300, in _submit_task
return worker.submit_task(function_id, args)
File "/home/ilab/anaconda3/envs/bmxnet/lib/python3.6/site-packages/ray/worker.py", line 528, in submit_task
args_for_local_scheduler.append(put(arg))
File "/home/ilab/anaconda3/envs/bmxnet/lib/python3.6/site-packages/ray/worker.py", line 2198, in put
worker.put_object(object_id, value)
File "/home/ilab/anaconda3/envs/bmxnet/lib/python3.6/site-packages/ray/worker.py", line 357, in put_object
self.store_and_register(object_id, value)
File "/home/ilab/anaconda3/envs/bmxnet/lib/python3.6/site-packages/ray/worker.py", line 287, in store_and_register
object_id.id()), self.serialization_context)
File "pyarrow/plasma.pyx", line 394, in pyarrow.plasma.PlasmaClient.put (/ray/src/thirdparty/arrow/python/build/temp.linux-x86_64-3.6/plasma.cxx:5157)
File "pyarrow/serialization.pxi", line 235, in pyarrow.lib.serialize (/ray/src/thirdparty/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:73959)
File "pyarrow/serialization.pxi", line 106, in pyarrow.lib.SerializationContext._serialize_callback (/ray/src/thirdparty/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:71924)
File "/home/ilab/anaconda3/envs/bmxnet/lib/python3.6/site-packages/cloudpickle/cloudpickle.py", line 881, in dumps
cp.dump(obj)
File "/home/ilab/anaconda3/envs/bmxnet/lib/python3.6/site-packages/cloudpickle/cloudpickle.py", line 268, in dump
return Pickler.dump(self, obj)
File "/home/ilab/anaconda3/envs/bmxnet/lib/python3.6/pickle.py", line 409, in dump
self.save(obj)
File "/home/ilab/anaconda3/envs/bmxnet/lib/python3.6/pickle.py", line 496, in save
rv = reduce(self.proto)
TypeError: can't pickle _thread.RLock objects
Disconnecting client on fd 9
Disconnecting client on fd 29
[WARN] (/ray/src/global_scheduler/global_scheduler.cc:404) Missed too many heartbeats from local scheduler, marking as dead.
Disconnecting client on fd 28
Disconnecting client on fd 27
Disconnecting client on fd 26
Disconnecting client on fd 24
Disconnecting client on fd 25
Disconnecting client on fd 23
Disconnecting client on fd 22
Disconnecting client on fd 21
Disconnecting client on fd 20
Disconnecting client on fd 19
Disconnecting client on fd 18
Disconnecting client on fd 17
Disconnecting client on fd 16
Disconnecting client on fd 15
Disconnecting client on fd 14
Disconnecting client on fd 13
Disconnecting client on fd 12
Disconnecting client on fd 11
Disconnecting client on fd 10
[WARN] (/ray/src/local_scheduler/local_scheduler_client.cc:112) Exiting because local scheduler closed connection.
Disconnecting client on fd 5
Disconnecting client on fd 31
Disconnecting client on fd 7

Most helpful comment

I think those warnings are harmless in this case, but it would make sense to provide native support for mx.NDArrays later on.

All 14 comments

Hi @ehsanmok, do you mind sharing some source code? It would help us understand what object isn't serializing correctly.

@richardliaw here's the relevant summarized code:

EDITED! will update soon.

I just tried

import mxnet as mx
import ray

ray.init()

x = mx.nd.array([[1, 2, 3], [4, 5, 6]])

Then

>>> x_id = ray.put(x)
WARNING: Falling back to serializing objects of type <class 'mxnet.ndarray.ndarray.NDArray'> by using pickle. This may be inefficient.
Exception ignored in: <object repr() failed>
Traceback (most recent call last):
  File "/Users/rkn/anaconda3/lib/python3.6/site-packages/mxnet/_ctypes/ndarray.py", line 51, in __del__
    check_call(_LIB.MXNDArrayFree(self.handle))
AttributeError: handle
>>> ray.get(x_id)
[[ 1.  2.  3.]
 [ 4.  5.  6.]]
<NDArray 2x3 @cpu(0)>

>>> x
[[ 1.  2.  3.]
 [ 4.  5.  6.]]
<NDArray 2x3 @cpu(0)>

so serialization of mxnet tensors gives a warning, but seems to behave correctly.

The application seems to be crashing because of the error

TypeError: can't pickle _thread.RLock objects

can you see which line is generating that? Do you have a remote function or actor which closes over an mxnet neural net?

@robertnishihara thanks for your reply! sorry, I got caught up fixing it! I think the problem is about gradients updates (which are of type mx.nd.NDArray) in parameter server (actor) and since I'm using an older version of MXNet, getting the right params is a little headache! I'll update on this.

@robertnishihara just out of curiosity, do you think you can leverage ONNX in ray? https://github.com/onnx/onnx

Yes, Ray and ONNX are completely compatible. ONNX should make it easier to support more neural net frameworks.

@richardliaw @robertnishihara I found the bug was in the way MXNet returns the intermediate gradients in python via gradients = module._exec_group.grad_arrays which is a List[List[mx.nd.NDArray]] (long live static typed languages!) so the problem was in push method of ParameterServer and I changed the add to (inplace element-wise add with) self.weights[key] += value[0] and it works now but as you also showed above there're serialization warnings

WARNING: Serializing objects of type <class 'ray.signature.FunctionSignature'> by expanding them as dictionaries of their fields. This behavior may be incorrect in some cases.
WARNING: Serializing objects of type <class 'mxnet.io.NDArrayIter'> by expanding them as dictionaries of their fields. This behavior may be incorrect in some cases

Do you have any suggestions for better serializations of this sort?

I think those warnings are harmless in this case, but it would make sense to provide native support for mx.NDArrays later on.

@richardliaw @robertnishihara actually I ran my async training test locally with 5 workers on CPU built version of MXNet (USE_CUDA=0 USE_CUDNN=0) (no suspect of my current reported this issue) to train lenet on mnist and after a while it crashes with no particular reason. Here's the complete logs:

Waiting for redis server at 127.0.0.1:56356 to respond...
Waiting for redis server at 127.0.0.1:38384 to respond...
Allowing the Plasma store to use up to 63.3832GB of memory.
Starting object store with directory /dev/shm and huge page support disabled
Starting local scheduler with 20 CPUs, 0 GPUs

======================================================================
View the web UI at http://localhost:8907/notebooks/ray_ui23072.ipynb?token=a0443352db2b82077d0d1aca166628e767e9d287a11e986b
======================================================================

WARNING: Falling back to serializing objects of type <class 'mxnet.ndarray.NDArray'> by using pickle. This may be inefficient.
Exception ignored in: <object repr() failed>
Traceback (most recent call last):
  File "/home/ilab/anaconda3/envs/bmxnet-cpu/lib/python3.6/site-packages/mxnet-0.10.1-py3.6-linux-x86_64.egg/mxnet/_ctypes/ndarray.py", line 36, in __del__
    check_call(_LIB.MXNDArrayFree(self.handle))
AttributeError: handle
WARNING: Serializing objects of type <class 'ray.signature.FunctionSignature'> by expanding them as dictionaries of their fields. This behavior may be incorrect in some cases.
WARNING: Serializing objects of type <class 'mxnet.io.NDArrayIter'> by expanding them as dictionaries of their fields. This behavior may be incorrect in some cases.
INFO:root:Epoch[0] Batch [10]   Speed: 200.88 samples/sec       accuracy=0.110909
INFO:root:Epoch[0] Batch [20]   Speed: 188.80 samples/sec       accuracy=0.104000
INFO:root:Epoch[0] Batch [30]   Speed: 177.89 samples/sec       accuracy=0.123000
INFO:root:Epoch[0] Batch [40]   Speed: 172.48 samples/sec       accuracy=0.111000
INFO:root:Epoch[0] Batch [50]   Speed: 182.57 samples/sec       accuracy=0.096000
INFO:root:Epoch[0] Batch [60]   Speed: 166.09 samples/sec       accuracy=0.104000
INFO:root:Epoch[0] Batch [70]   Speed: 168.58 samples/sec       accuracy=0.110000
INFO:root:Epoch[0] Batch [80]   Speed: 173.67 samples/sec       accuracy=0.105000
INFO:root:Epoch[0] Batch [90]   Speed: 186.25 samples/sec       accuracy=0.107000
INFO:root:Epoch[0] Batch [100]  Speed: 182.93 samples/sec       accuracy=0.130000
INFO:root:Epoch[0] Batch [110]  Speed: 177.66 samples/sec       accuracy=0.118000
INFO:root:Epoch[0] Batch [120]  Speed: 170.34 samples/sec       accuracy=0.116000
INFO:root:Epoch[0] Batch [130]  Speed: 181.30 samples/sec       accuracy=0.114000
INFO:root:Epoch[0] Batch [140]  Speed: 176.14 samples/sec       accuracy=0.119000
INFO:root:Epoch[0] Batch [150]  Speed: 181.17 samples/sec       accuracy=0.103000
INFO:root:Epoch[0] Batch [160]  Speed: 181.85 samples/sec       accuracy=0.129000
INFO:root:Epoch[0] Batch [170]  Speed: 175.28 samples/sec       accuracy=0.110000
INFO:root:Epoch[0] Batch [180]  Speed: 175.60 samples/sec       accuracy=0.113000
INFO:root:Epoch[0] Batch [190]  Speed: 177.59 samples/sec       accuracy=0.114000
INFO:root:Epoch[0] Batch [200]  Speed: 182.92 samples/sec       accuracy=0.129000
INFO:root:Epoch[0] Batch [210]  Speed: 186.09 samples/sec       accuracy=0.130000
INFO:root:Epoch[0] Batch [220]  Speed: 170.58 samples/sec       accuracy=0.125000
INFO:root:Epoch[0] Batch [230]  Speed: 182.92 samples/sec       accuracy=0.123000
INFO:root:Epoch[0] Batch [240]  Speed: 187.25 samples/sec       accuracy=0.112000
INFO:root:Epoch[0] Batch [250]  Speed: 180.10 samples/sec       accuracy=0.122000
INFO:root:Epoch[0] Batch [260]  Speed: 180.23 samples/sec       accuracy=0.095000
INFO:root:Epoch[0] Batch [270]  Speed: 176.00 samples/sec       accuracy=0.109000
INFO:root:Epoch[0] Batch [280]  Speed: 186.35 samples/sec       accuracy=0.125000
INFO:root:Epoch[0] Batch [290]  Speed: 189.34 samples/sec       accuracy=0.104000
INFO:root:Epoch[0] Batch [300]  Speed: 186.79 samples/sec       accuracy=0.116000
INFO:root:Epoch[0] Batch [310]  Speed: 176.49 samples/sec       accuracy=0.099000
INFO:root:Epoch[0] Batch [320]  Speed: 162.21 samples/sec       accuracy=0.106000
INFO:root:Epoch[0] Batch [330]  Speed: 160.42 samples/sec       accuracy=0.106000
INFO:root:Epoch[0] Batch [340]  Speed: 174.17 samples/sec       accuracy=0.105000
INFO:root:Epoch[0] Batch [350]  Speed: 168.97 samples/sec       accuracy=0.098000
INFO:root:Epoch[0] Batch [360]  Speed: 190.36 samples/sec       accuracy=0.121000
INFO:root:Epoch[0] Batch [370]  Speed: 184.14 samples/sec       accuracy=0.101000
INFO:root:Epoch[0] Batch [380]  Speed: 186.22 samples/sec       accuracy=0.114000
INFO:root:Epoch[0] Batch [390]  Speed: 208.58 samples/sec       accuracy=0.120000
INFO:root:Epoch[0] Batch [400]  Speed: 182.48 samples/sec       accuracy=0.108000
INFO:root:Epoch[0] Batch [410]  Speed: 180.31 samples/sec       accuracy=0.108000
INFO:root:Epoch[0] Batch [420]  Speed: 175.86 samples/sec       accuracy=0.117000
INFO:root:Epoch[0] Batch [430]  Speed: 167.76 samples/sec       accuracy=0.123000
INFO:root:Epoch[0] Batch [440]  Speed: 183.33 samples/sec       accuracy=0.112000
INFO:root:Epoch[0] Batch [450]  Speed: 173.45 samples/sec       accuracy=0.120000
INFO:root:Epoch[0] Batch [460]  Speed: 200.67 samples/sec       accuracy=0.095000
INFO:root:Epoch[0] Batch [470]  Speed: 180.69 samples/sec       accuracy=0.126000
INFO:root:Epoch[0] Batch [480]  Speed: 180.14 samples/sec       accuracy=0.105000
INFO:root:Epoch[0] Batch [490]  Speed: 187.03 samples/sec       accuracy=0.100000
INFO:root:Epoch[0] Batch [500]  Speed: 179.88 samples/sec       accuracy=0.111000
INFO:root:Epoch[0] Batch [510]  Speed: 172.18 samples/sec       accuracy=0.103000
INFO:root:Epoch[0] Batch [520]  Speed: 172.94 samples/sec       accuracy=0.119000
INFO:root:Epoch[0] Batch [530]  Speed: 181.15 samples/sec       accuracy=0.116000
INFO:root:Epoch[0] Batch [540]  Speed: 185.95 samples/sec       accuracy=0.109000
INFO:root:Epoch[0] Batch [550]  Speed: 181.31 samples/sec       accuracy=0.114000
INFO:root:Epoch[0] Batch [560]  Speed: 194.81 samples/sec       accuracy=0.108000
INFO:root:Epoch[0] Batch [570]  Speed: 176.30 samples/sec       accuracy=0.119000
INFO:root:Epoch[0] Batch [580]  Speed: 199.54 samples/sec       accuracy=0.103000
INFO:root:Epoch[0] Batch [590]  Speed: 194.18 samples/sec       accuracy=0.115000
Disconnecting client on fd 9
Disconnecting client on fd 28
Disconnecting client on fd 24
Disconnecting client on fd 23
Disconnecting client on fd 29
Disconnecting client on fd 22
Disconnecting client on fd 21
Disconnecting client on fd 20
Disconnecting client on fd 19
Disconnecting client on fd 18
Disconnecting client on fd 17
Disconnecting client on fd 16
Disconnecting client on fd 15
Disconnecting client on fd 14
Disconnecting client on fd 13
Disconnecting client on fd 12
Disconnecting client on fd 25
[WARN] (/ray/src/local_scheduler/local_scheduler_client.cc:112) Exiting because local scheduler closed connection.
[WARN] (/ray/src/global_scheduler/global_scheduler.cc:404) Missed too many heartbeats from local scheduler, marking as dead.

@richardliaw @robertnishihara sorry guys, this turns our not to be an issue. It's just at the end of epoch it does that. It'd be better to end things more gracefully though (perhaps with contextmanager!) I tested this on mxnet-cpu and mxnet-gpu and it works fine.

You're referring to all of the logging statements

Disconnecting client on fd 9
Disconnecting client on fd 28
Disconnecting client on fd 24
Disconnecting client on fd 23
Disconnecting client on fd 29
Disconnecting client on fd 22
Disconnecting client on fd 21
Disconnecting client on fd 20
Disconnecting client on fd 19
Disconnecting client on fd 18
Disconnecting client on fd 17
Disconnecting client on fd 16
Disconnecting client on fd 15
Disconnecting client on fd 14
Disconnecting client on fd 13
Disconnecting client on fd 12
Disconnecting client on fd 25
[WARN] (/ray/src/local_scheduler/local_scheduler_client.cc:112) Exiting because local scheduler closed connection.
[WARN] (/ray/src/global_scheduler/global_scheduler.cc:404) Missed too many heartbeats from local scheduler, marking as dead.

that happen when the driver exits? You're right we should get rid of that.

@ehsanmok I created #1324 to keep track of that.

@robertnishihara yes, exactly! More graceful exist would prevent confusion. I confused the above with driver failure.

Closing because I believe this has been resolved (aside from the cleaner warning messages). Please reopen if there are still issues!

Was this page helpful?
0 / 5 - 0 ratings