Ray: Java serialization should use Apache Arrow serialization format.

Created on 30 May 2018  路  7Comments  路  Source: ray-project/ray

java

Most helpful comment

One day this will hopefully allow interop between Java and Python, although here is going to be some more work involved to actually make that happen (in particular, writing something like pyarrow.serialize for Java).

Note that this is currently blocked on https://issues.apache.org/jira/browse/ARROW-1692, any help with that would be appreciated @eric-jj @songqing @imzhenyu @salah-man

All 7 comments

One day this will hopefully allow interop between Java and Python, although here is going to be some more work involved to actually make that happen (in particular, writing something like pyarrow.serialize for Java).

Note that this is currently blocked on https://issues.apache.org/jira/browse/ARROW-1692, any help with that would be appreciated @eric-jj @songqing @imzhenyu @salah-man

Someone in my team will take the work item.

Hi @robertnishihara , I have some questions about enabling Arrow serialization for Java worker and hope you can kindly help to input here, thanks in advance!
Based on my rough understanding about Arrow, ideally we should only have an array of byte in memory to keep the data, which is managed by and also accessed through Arrow java interface. That means in the application layer, we may not be able to write java code in a nature way, for example call classInstance.fieldA, and the code will also be complicated. So my opinion is:

  1. Introduce a helper class, within which we implement serialization/deserialization for simple types (int, string, bool, etc.) and sets (list, map, etc.) of simple types.
  2. Introduce an interface that all the customized defined classes need to derive from. A customized classed need to implement the serialize/deserialize by itself or convert the data into sets of simple types. But in this way we may need to create new java class instances during deserialization.
    Does it sound good to you?

Point 1 is similar to what we did to implement pyarrow.serialize and pyarrow.deserialize. See https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/arrow_to_python.cc and https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/python_to_arrow.cc.

With Python, we don't have an interface for custom classes to extend and instead try to do it automatically (though this doesn't always work).

Basically, for a custom class in Python like

class Foo:
    def __init__(self):
        self.a = 1
        self.b = 2

we first convert it to a dictionary like {'_pytype_': 'Foo', 'data': {'a': 1, 'b': 2}} and then serialize that using Arrow. Then to deserialize it, we deserialize it back to a dictionary using Arrow, then we use the string 'Foo' to look up the class definition (which has already been broadcast everywhere by a different mechanism), and then instantiate a Foo object from the data field (see https://github.com/apache/arrow/blob/a82a0273a0b9f8583de005d96475ab8685963ed8/python/pyarrow/serialization.pxi#L128-L186, those methods are called from https://github.com/apache/arrow/blob/a82a0273a0b9f8583de005d96475ab8685963ed8/cpp/src/arrow/python/python_to_arrow.cc#L383).

Do you think an approach like this could work in Java? It's possible that the natural approach in Java differs here, but something like this might work.

For primitive types and maybe some simple objects like arrays/lists/tuples, it may make sense to use the same format that we use in the Python to Arrow code (so that some object serialized from Python may be possible to deserialize from Java).

cc @pcmoritz

cc @wesm @jacques-n

Hi @robertnishihara , nice to hear your response!
Yes, it makes sense to call Arrow in a similar way from both Java and Python, which will make the caller code in Ray (and potential others) more elegant. But to me it sounds like an interface refinement that better to be put in Arrow. From our side we may want to prioritize Arrow adoption in Ray at this point. Actually the current Java serialization interfaces (Serializer.encode/Serializer.decode) are similar to pyarrow.serialize and pyarrow.deserialize and I will follow the style here. For the interface I mentioned in #2, it will be used as the callback functions like in python. We need a custom way for each class to implement their own logic, and an interface class should be a reasonable option. For those primitive types, yes they should have the same format as Python, if Arrow does so. :)

Automatically closing stale issue. Please re-open if still relevant.

Was this page helpful?
0 / 5 - 0 ratings