One day this will hopefully allow interop between Java and Python, although here is going to be some more work involved to actually make that happen (in particular, writing something like pyarrow.serialize for Java).
Note that this is currently blocked on https://issues.apache.org/jira/browse/ARROW-1692, any help with that would be appreciated @eric-jj @songqing @imzhenyu @salah-man
Someone in my team will take the work item.
Hi @robertnishihara , I have some questions about enabling Arrow serialization for Java worker and hope you can kindly help to input here, thanks in advance!
Based on my rough understanding about Arrow, ideally we should only have an array of byte in memory to keep the data, which is managed by and also accessed through Arrow java interface. That means in the application layer, we may not be able to write java code in a nature way, for example call classInstance.fieldA, and the code will also be complicated. So my opinion is:
Point 1 is similar to what we did to implement pyarrow.serialize and pyarrow.deserialize. See https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/arrow_to_python.cc and https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/python_to_arrow.cc.
With Python, we don't have an interface for custom classes to extend and instead try to do it automatically (though this doesn't always work).
Basically, for a custom class in Python like
class Foo:
def __init__(self):
self.a = 1
self.b = 2
we first convert it to a dictionary like {'_pytype_': 'Foo', 'data': {'a': 1, 'b': 2}} and then serialize that using Arrow. Then to deserialize it, we deserialize it back to a dictionary using Arrow, then we use the string 'Foo' to look up the class definition (which has already been broadcast everywhere by a different mechanism), and then instantiate a Foo object from the data field (see https://github.com/apache/arrow/blob/a82a0273a0b9f8583de005d96475ab8685963ed8/python/pyarrow/serialization.pxi#L128-L186, those methods are called from https://github.com/apache/arrow/blob/a82a0273a0b9f8583de005d96475ab8685963ed8/cpp/src/arrow/python/python_to_arrow.cc#L383).
Do you think an approach like this could work in Java? It's possible that the natural approach in Java differs here, but something like this might work.
For primitive types and maybe some simple objects like arrays/lists/tuples, it may make sense to use the same format that we use in the Python to Arrow code (so that some object serialized from Python may be possible to deserialize from Java).
cc @pcmoritz
cc @wesm @jacques-n
Hi @robertnishihara , nice to hear your response!
Yes, it makes sense to call Arrow in a similar way from both Java and Python, which will make the caller code in Ray (and potential others) more elegant. But to me it sounds like an interface refinement that better to be put in Arrow. From our side we may want to prioritize Arrow adoption in Ray at this point. Actually the current Java serialization interfaces (Serializer.encode/Serializer.decode) are similar to pyarrow.serialize and pyarrow.deserialize and I will follow the style here. For the interface I mentioned in #2, it will be used as the callback functions like in python. We need a custom way for each class to implement their own logic, and an interface class should be a reasonable option. For those primitive types, yes they should have the same format as Python, if Arrow does so. :)
Automatically closing stale issue. Please re-open if still relevant.
Most helpful comment
One day this will hopefully allow interop between Java and Python, although here is going to be some more work involved to actually make that happen (in particular, writing something like pyarrow.serialize for Java).
Note that this is currently blocked on https://issues.apache.org/jira/browse/ARROW-1692, any help with that would be appreciated @eric-jj @songqing @imzhenyu @salah-man