Ray: Ray serialization problems

Created on 17 May 2017  路  21Comments  路  Source: ray-project/ray

We should try to understand better what we can serialize and what we cannot, so I'm making this issue to collect problems with our serialization.

There are five categories of problems:

  1. Things that not even python pickle can serialize. These are mostly out of scope for us.
  2. Things that python pickle can serialize but cloudpickle cannot. We'd like to be aware of these and potentially fix them or report them upstream.
  3. Things that cloudpickle can serialize but we cannot. We'd like to be aware of these and fix them.
  4. Things we we serialize with arrow but deserialization is incorrect. We'd like to fix these.
  5. Things where serialization is slower than expected. We'd like to know about these problems and potentially fix them.

If you run into problems like the above, please post a minimal example here or create a new issue for it.

Most helpful comment

This is a bit of a doozy, but I would really like to be able to use ray with the excellent sacred library - they belong together, like peanut butter and jelly. (I'm specifically referring ray tune here)

The sacred.Experiment object is a nightmare to serialize though. pickle fails to serialize it in its simplest incarnation and real-world experiments are usually significantly more complicated.

from sacred.experiment import Experiment
test_exp = Experiment('test')

import pickle
pickle.dumps(test_exp) # nope

import cloudpickle
cloudpickle.dumps(test_exp) # nope

import dill 
dill.dumps(test_exp) # also nope

Is there any hope for ever serializing experiment?

If not, there is an early stage discussion about an OO-based interface to sacred that might provide a path forward.

Update: I have managed to get tune running without having to serialize experiment and am now no longer sure that is required

All 21 comments

Here is one that fits into 3.:

import re
b = re.compile(r"\d+\.\d*")

We actually can handle this case. If you try ray.get(ray.put(b)), it works. The thing that is failing when you pass b into a remote function is something else. I'm looking into it.

Ok, the issue here is that pickle.dumps(b) succeeds, but pickle.dumps(type(b)) fails, so we can't do the usual register_class approach because that involves pickling the type.

That said, it seems like we shouldn't actually need to do register_class if we're pickling things.

Is category (2) a real thing? Cloudpickle is supposed to be more general. It is true, that pickle "succeeds" at serializing some things that cloudpickle fails at, e.g.,

class Foo(object):
  def __init__(self):
    super

cloudpickle.dumps(Foo)  # PicklingError: Could not pickle object as excessively deep recursion required.
pickle.dumps(Foo)  # b'\x80\x03c__main__\nFoo\nq\x00.'

However, this is misleading. Pickle only succeeds because it doesn't capture the class definition (you couldn't unpickle it in another process).

After #550, I think category (3) should more or less be nonexistent. I'm probably wrong about this, but we'll see.

Category (4) is a big problem. E.g., #319.

I just remembered an important class of types that fall in category (3), which is subtypes of standard types like lists/dicts/tuples (see #512 for an example).

For example, if we serialize a collections.defaultdict, then it will be deserialized as dictionary. This presumably happens in a lot of types.

I opened https://issues.apache.org/jira/browse/ARROW-1059 with the goal of being able to more easily expand the kinds of objects that can be serialized using the Arrow IPC machinery. So if you had objects that don't fit the set of "fast storage" types (tensors and the other primitive types supported), then these could be stored as pickles. pyarrow could register a pickle deserializer to be able to understand this user "message" type

Ray currently fails to serialize xarray objects #748 (this is category (3) above).

Ray also fails to serialize tensorflow-related objects, like Tensor.
Do you have plan to fix it?
I'm working on Reinforcement learning algorithms, and sometimes I need to broadcast a tf model or Tensor and it fails.

This is already possible with a helper function, see docs.

@programmerlwj the underlying problem is that TensorFlow objects can't easily be serialized (e.g., with pickle). The preferable solution is to only ship the neural net weights or the gradient (e.g., as a list or dictionary of numpy arrays).

This is a bit of a doozy, but I would really like to be able to use ray with the excellent sacred library - they belong together, like peanut butter and jelly. (I'm specifically referring ray tune here)

The sacred.Experiment object is a nightmare to serialize though. pickle fails to serialize it in its simplest incarnation and real-world experiments are usually significantly more complicated.

from sacred.experiment import Experiment
test_exp = Experiment('test')

import pickle
pickle.dumps(test_exp) # nope

import cloudpickle
cloudpickle.dumps(test_exp) # nope

import dill 
dill.dumps(test_exp) # also nope

Is there any hope for ever serializing experiment?

If not, there is an early stage discussion about an OO-based interface to sacred that might provide a path forward.

Update: I have managed to get tune running without having to serialize experiment and am now no longer sure that is required

We encountered many difficulties with serialization. #3917
In all cases, we wrote our own __reduce__ method to pickle our objects for other purposes. Can we configure ray to be used with custom serialization?
Thank you!

@JessicaSchrouff yes definitely, can you see if ray.register_custom_serializer does what you want? https://ray.readthedocs.io/en/latest/api.html#ray.register_custom_serializer

@pcmoritz
I'm having some trouble serializing spacy objects (nlp models) :
pickle cannot serialize them , but cloudpickle and dill do.
would you have a recommendation or should I go with a custom serialization function ?

Great work !

@alaameloh Can you share an example that you would like to get working?

python custom_model = spacy.load('path_to_model') for doc in documents: result_ids.append(processing_function.remote(doc, custom_model))
here ray falls back to using pickle to serialize the spacy model. however , with the current spacy version i'm working with (2.0.16), pickle doesn't work (but cloud pickle does). it gets stuck afterwards and crashed after some time (due to memory i believe).

depending on the model, the loading time for spacy would be an inconvenience if i simply loaded the model inside the processing_function and executed it with every call.

@alaameloh hm, when we say "falling back to pickle" I think we actually mean "cloudpickle". Can you provide a runnable code snippet that we can use to reproduce the issue?

@robertnishihara

import spacy
import ray

@ray.remote
def processing(nlp,text):
    processed = nlp(text)
    return processed

ray.init()
texts_list = ['this is just a random text to illustrate the serializing use case for ray',
              'Ray is a flexible, high-performance distributed execution framework.To launch a Ray cluster, \
              either privately, on AWS, or on GCP, follow these instructions.',
              'This document describes what Python objects Ray can and cannot serialize into the object store. \
              Once an object is placed in the object store, it is immutable.']

processed_id = []
nlp = spacy.load('en')
for text in texts_list:
    processed_id.append(processing.remote(nlp,text))

processed_list = ray.get(processed_id)
WARNING: Falling back to serializing objects of type <class 'pathlib.PosixPath'> by using pickle. This may be inefficient.
WARNING: Falling back to serializing objects of type <class 'spacy.vocab.Vocab'> by using pickle. This may be inefficient.
WARNING: Falling back to serializing objects of type <class 'spacy.tokenizer.Tokenizer'> by using pickle. This may be inefficient.
WARNING: Falling back to serializing objects of type <class 'spacy.pipeline.DependencyParser'> by using pickle. This may be inefficient.
WARNING: Falling back to serializing objects of type <class 'spacy.pipeline.EntityRecognizer'> by using pickle. This may be inefficient.
WARNING: Logging before InitGoogleLogging() is written to STDERR
W0408 14:25:37.543133 10242 node_manager.cc:245] Last heartbeat was sent 961 ms ago 
Traceback (most recent call last):

  File "/home/alaameloh/anaconda3/envs/poc/lib/python3.7/site-packages/ray/remote_function.py", line 71, in remote
    return self._remote(args=args, kwargs=kwargs)
  File "/home/alaameloh/anaconda3/envs/poc/lib/python3.7/site-packages/ray/remote_function.py", line 121, in _remote
    resources=resources)
  File "/home/alaameloh/anaconda3/envs/poc/lib/python3.7/site-packages/ray/worker.py", line 591, in submit_task
    args_for_local_scheduler.append(put(arg))
  File "/home/alaameloh/anaconda3/envs/poc/lib/python3.7/site-packages/ray/worker.py", line 2233, in put
    worker.put_object(object_id, value)
  File "/home/alaameloh/anaconda3/envs/poc/lib/python3.7/site-packages/ray/worker.py", line 359, in put_object
    self.store_and_register(object_id, value)
  File "/home/alaameloh/anaconda3/envs/poc/lib/python3.7/site-packages/ray/worker.py", line 293, in store_and_register
    self.task_driver_id))
  File "/home/alaameloh/anaconda3/envs/poc/lib/python3.7/site-packages/ray/utils.py", line 437, in _wrapper
    return orig_attr(*args, **kwargs)
  File "pyarrow/_plasma.pyx", line 493, in pyarrow._plasma.PlasmaClient.put
  File "pyarrow/serialization.pxi", line 345, in pyarrow.lib.serialize
  File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
pyarrow.lib.ArrowMemoryError: malloc of size 134217728 failed

Wanted to add a couple of notes here for problems that I am experiencing related to serialization.

  1. Database connections or a pooled object cannot be passed, connections must be open/closed inside of the remote function.
  2. Unit testing ray code is very challenging and I suspect it's due to serialization. Example:
import ray
ray.init()
def manager_fun_list(x_list, fun): 
    res = ray.get([manager_fun_one(x, fun) for x in x_list])
    return res

@ray.remote
def manager_fun_one(x, fun):
    return(fun(x))

Now let's test this code:

import unittest
from .dummy import manager_fun_list

class TestDummy(unittest.TestCase):
    x_list = [1,2,3,4,5]
    def myfun(self, x):
        return x*2

    def test_managed_myfun(self):
        manager_fun_list(self.x_list, self.myfun)

Running the test results in TypeError: Cannot serialize socket object. Likely because of the above discussed problem with self / super.

Now, here's something interesting, that I can't seem to get passed when trying to create unit tests for your own recommended solution (https://ray.readthedocs.io/en/latest/serialization.html#last-resort-workaround)

import ray
import pickle

ray.init()
def manager_fun_list(x_list, fun):
    fun = pickle.dumps(fun)
    res = ray.get([manager_fun_one.remote(x, fun) for x in x_list])
    return res

@ray.remote
def manager_fun_one(x, fun):
    fun = pickle.loads(fun)
    return(fun(x))
import unittest
from .dummy import manager_fun_list

def myfun(x):
    return x*2

class TestDummy(unittest.TestCase):
    x_list = [1,2,3,4,5]

    def test_managed_myfun(self):
        res = manager_fun_list(self.x_list, myfun)
        self.assertEqual(res, [2,4,5,8,10])

And the result... is.... ??? huh?

ERROR: test_managed_myfun (tests.test_manager_fun_list.TestDummy)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/workspaces/.../.../tests/test_manager_fun_list.py", line 12, in test_managed_myfun
    res = manager_fun_list(self.x_list, myfun)
  File "/workspaces/.../.../dummy.py", line 7, in manager_fun_list
    res = ray.get([manager_fun_one.remote(x, fun) for x in x_list])
  File "/usr/local/lib/python3.7/site-packages/ray/worker.py", line 2197, in get
    raise value
ray.exceptions.RayTaskError: ray_worker (pid=11502, host=662e97277f24)
  File "/workspaces/.../.../dummy.py", line 12, in manager_fun_one
    fun = pickle.loads(fun)
ModuleNotFoundError: No module named 'tests'

This are all fixed I think.

Was this page helpful?
0 / 5 - 0 ratings