Ray: [RFC] Improve Java API

Created on 31 May 2018 · 41Comments · Source: ray-project/ray

Hi there. This is Hao Chen from Ant Financial.

Recently we had some internal discussions on improving the Java API. Before making any change, we'd like to open a wider discussion here on the possible options.

If you're interested in using Ray on Java, your feedbacks are very much valuable. Thanks in advance!

Option 1: `Ray.call` (Current implementation)

Let's say we have the following method

class Echo {
    public static int echo(int x) { return x; }
}

To invoke this method remotely on ray, we can do

RayObject<Integer> res = Ray.call(Echo::echo, 1);
System.out.println(res.get());

Problems

Compared with Python API (echo.remote(1)), this way is less natural, and is somewhat against Java convention.
Its internal implementation is a little bit hacky. (The TLDR explanation is that the implementation uses byte-code manipulation to change function's behavior at runtime. I'm not gonna to go too deeply because I'd like to make this discussion more focused on users' perspective.)

A Potential solution

Java comes with a Annotation Processor API that allows you to generate new source files at compile time.
Note:

it can only generate new files, cannot modify existing files (*).
This is done by Java compiler, doesn't need any support from editors or IDEs.

Option 2: `EchoRemote`

If you annotate the original class with @RayRemote, the annotation processor will generate a remote version:

class EchoRemote {
    public static RayObject<Integer> echo(int x) { ... }
    public static RayObject<Integer> echo(RayObject<Integer> x) { ... }
}

Then you'll use it like this:

RayObject<Integer> res = EchoRemote.echo(1);
System.out.println(res);

Here's a prototype of this idea: https://github.com/raulchen/ray_annotation_test
Original file: https://github.com/raulchen/ray_annotation_test/blob/master/annotation_test/src/main/java/test/Adder.java
Call-site: https://github.com/raulchen/ray_annotation_test/blob/master/annotation_test/src/main/java/Main.java

Pros:

API looks more natural and elegant.
Implementation is cleaner, safer and easier to debug.

Con

Some of us don't like the fact that the class name is different.

Option 3: `x.y.z.remote.Echo`

One way to avoid changing class name would be moving the new class to a different package. Let's say the original package name is x.y.z, the new package will be x.y.z.remote. However,

If you want to use both classes in the same file, you'll need to write full names, because Java doesn't support name alias.
Although the names are the same, but the classes are different. It could cause a bug if one wants to compare the classes.

Option N

We'd like to hear your voice if you have a better solution :)

(*): Technically, modifying existing code is possible, but it needs to use internal API, requires considerable amount of work, and has some effects (related discussion can be found at https://stackoverflow.com/questions/3852091/is-it-safe-to-use-project-lombok). Comparing the benefits and drawbacks, we don't want to list it as an option here.

java question

Source

raulchen

Most helpful comment

@raulchen I've added mine. It sounds like there is no perfect choice right now, but I assume it will be possible to iterate on the decision if we come up with better ideas.

robertnishihara on 8 Jun 2018

👍2

All 41 comments

cc @robertnishihara @imzhenyu @eric-jj

raulchen on 31 May 2018

Both option 2 and option 3 are clearly better than what is currently there, I prefer option 2 because it seems more convenient for the user (not having to deal with the naming clashes). If the generation of new names is documented, it should be fine, but I'm curious what others think about this.

I'm super glad that we can get rid of the bytecode rewrite :)

pcmoritz on 31 May 2018

Option 2 looks pretty good. If I understand correctly, it's generating a thin wrapper class that delegates to the original right, so there shouldn't be any class name issues.

I assume actors would work the same way?

ericl on 31 May 2018

Does option 2 generate 2^N overloaded functions for each method, where N is the number of arguments?

robertnishihara on 31 May 2018

I prefer option 2 as well. Even if the class name is different, the rule change is natural: ClassName -> RemoteClassName, which I don't think is bad.

One important aspect to consider in all these cases is the easy of debugging. I guess option 2 scores pretty well here as well, as it requires no support from IDE and the user will very quickly figure out what the annotation is doing over the hood.

Also, it would be great to give an example for actors as well.

istoica on 1 Jun 2018

Thanks for the feedbacks :)

@ericl @istoica:
The generated class will be a very thin wrapper that simply calls another submitTask function.

Actors should work very similarly. My current thought is to have another @RayActor annotation, which generates only non-static methods, whereas @RayRemote only generates static methods.

@robertnishihara:
Unfortunately, yes. I didn't think of a better solution for this issue in Java.
(Actually, in terms of this point, option 2 is worse than option 1, because it generates 2^N methods for each class.)

To mitigate this issue, my plan is to leave the choice to the users. The annotation will come with an option that let the user choose 1) generate 2^N versions, or 2) only generate RayObject parameters. If the users choose 2), they'll need to wrap their parameters before calling the remote functions. E.g., Echo.echo(RayObject.of(1)). This is not ideal, but looks acceptable to me.

raulchen on 1 Jun 2018

@raulchen Regarding the option to let user choose whether to generate all 2^N call versions maybe not do it initially? Let's first see how bad it is, what is the typical number of arguments people are passing. Also, if you add the option, I think the default should be generating the 2^N versions.

Using @RayActor instead of @RayRemote would make the API a bit inconsistent with the Python one. Anyway, please provide a simple example for classes/actors as well. You could use the counter example we are using in Python to illustrate actors, e.g.,

class Counter(object):
    def __init__(self):
        self.value = 0
    def inc(self):
        self.value += 1
        return self.value

c = Counter()
c.inc()
c.inc()
c.inc()

becoming

@ray.remote
class Counter(object):
    def __init__(self):
        self.value = 0
    def inc(self):
        self.value += 1
        return self.value

c = Counter.remote()
id1 = c.inc.remote ()
id2 = c.inc.remote ()
id3 = c.inc.remote ()
ray.get([id1, id2, id3])

istoica on 1 Jun 2018

@istoica Sure. Let's say the original class is

class Counter {
    private int value;
    public int incr(int delta) {
        value += delta;
        return value;
    }
}

If user annotate it with @RayActor, the following class will be generated:

class CounterActor {
    public static CounterActor create() {
        // create a remote actor and return a local actor handler
    }
    public RayObject<Integer> incr(int delta) {
        // put parameters and submit task
    }
    public RayObject<Integer> incr(RayObject<Integer> delta) { // same here }
}

Then user can use this like this:

CounterActor c = CounterActor.create();
RayObject<Integer> res = c.incr(1);
res.get();

PS1: the reason why there're 2 different annotations is because both annotations decorate classes (Java doesn't have standalone functions).
PS2: it makes sense to use 2^N as default. It's also fine to not provide the option first.

raulchen on 1 Jun 2018

Thanks @raulchen. I think that give all constraints this is a fine solution.

We should also come up with the API for C++ just to make sure we minimize the inconsistencies across Python, Java, and C++ APIs.

istoica on 1 Jun 2018

Thanks @raulchen. Question, why RayObject.of(1) instead of Ray.put(1)?

robertnishihara on 1 Jun 2018

@robertnishihara
Ray.put is also fine here, but may not be the best API for parameters.
Because Pay.put looks more like putting the object into the global object store. But here, this 1 doesn't have to be in the global object store, because it can be passed by value.

Here's my proposal about the RayObject API:

RayObject is just an abstract interface that has a get method:

public interface RayObject <T> {
    public T get();
}

It has 2 sub-classes:
1) RayGlobalObject represents an object that is in the global object store. It has an UniqueID, and may or may not be locally available.
2) RayLocalObject is for objects that will only be needed locally, for example, parameters.

RayGlobalObject can be attained via Ray.put, RayLocalObject can be attained via Ray.param. (Ray.param is the same as the above RayObject.of, but this name should be better because of consistency with Ray.put)

Thus, users can use either Ray.put or Ray.param, depending on whether they want to put the object in global object store or not.

How does this sound to you?

Update:
1) The return value of a remote function is also a RayGlobalObject, but we can declare it as RayObject for now in case we'll have return-by-value in the future;
2) Ray.param can also return a RayGlobalObject if the object is too large to be passed by value;
2) We can also name them RayRefObject/RayValueObject, (I have slight preference for RayGlobalObject/RayLocalObject);

raulchen on 2 Jun 2018

Nice discussion, would like to provide more context for making the best decision here.

Option 1:

currently, we have two ways to run the java remote functions: remote lambda and binary rewrite. The former is safer but more costly in terms of runtime overhead, and it is used during development. The latter is more complicated but much more performant, and it is used in cluster when people submit jobs onto the cluster.
debugging is now supported in Java in one single process, without C/C++ components as well as binary rewrite.

Option 2 is good in terms of removing binary rewrite and being consistent with what we have now in Python. However, it has a major drawback that it introduces some 'from-no-where' types that does not preserve the type completeness during development (i.e., an IDE plugin may be confused and fire errors as the remote types are not found anywhere).

Option 3 is better but as discussed, it may cause potential bugs at runtime (esp. when we have third party libraries that depend on the same set of classes).

imzhenyu on 2 Jun 2018

@imzhenyu good points. A few questions:

Regarding the code rewrite vs remote lambda (Option 1), do we have any numbers about the overhead incurred by remote lambda? BTW, I assume that the lambda function is JITed only the first time is submitted, right? If yes, only the first submission will be really slow. Also, in the case of actors, I believe the class will be JITed at the creation time and all subsequent method submissions will incur no over overhead, no?
When you are saying that the debugging is supported in one single process, do you mean that you can run the entire application in a single process where all remote tasks and classes are running in the same process? Also, in this case, does each remote task/class run in a different thread?
Regarding the IDE getting confused and firing errors as remote types. Since Annotation Processor API is now part of Java, doesn't any modern Java IDE handle the code generated by it?

Thanks.

istoica on 2 Jun 2018

Thanks @imzhenyu I agree that it's undesirable to have a type come out of nowhere.

@raulchen, if we just generate all 2^N function signatures for now then we don't have to introduce the concept of RayLocalObjects, right? And we can just use RayGlobalObjects.

The distinction makes some sense, but I think it'd be better to not introduce the concept unless it is definitely necessary.

robertnishihara on 3 Jun 2018

@istoica @imzhenyu
Regarding IDE support, Idea needs only a couple of clicks when importing the project. Afterwards, everything is automatic. I didn't test Eclipse, but I'd assume it should have no issue. Because the work is mostly done by Java compiler, not IDE.

@robertnishihara
That's right. If we decide to always generate all 2^N signatures, RayLocalObjects aren't necessary. We can start without it. Refactor should be easy in the future.

raulchen on 3 Jun 2018

@istoica @raulchen

performance between remote lambda vs binary rewrite. We don't have a performance comparison between remote lambda and binary rewrite. But you may look here. A one time cost of remote lambda may not be enough, as we don't find a way to get a key of the lambda function so that we can cache and fetch the lambda.
in SINGLE_PRORCESS mode (provided to our application developers so they don't bother non-java libraries or processes or envs), the worker runs purely in Java. We do have a version in multiple threads, but right now we have a simpler version in one single thread due to simpler implementation.
IDE compatability about annotation processor. I don't really test it myself, and according to @raulchen, there won't be much problem, as you expected. However, I may still not feel very comfortable about RemoteFoo because this is against my (as a programmer) type completeness intuition during programming (esp. when the whole thing is transparent). Instead, I would prefer to have an explicit process to generate the initial version of RemoteFoo and then the compiler helps maintain the consistence between Foo and RemoteFoo. That's said, I'm not a typical Java programmer, so this really depends on the community's taste.

imzhenyu on 4 Jun 2018

@imzhenyu about naming, you're saying you would prefer something like RemoteFoo = Ray.remote(Foo)? I agree that would be more intuitive than generating classes out of no where. I also am not sure what is intuitive for Java programmers.

robertnishihara on 4 Jun 2018

Btw, another consideration here is how proposals would work with the broader JVM ecosystem. E.g., which proposals would work with Scala out of the box and which wouldn't.

ericl on 4 Jun 2018

👍2

@ericl that's a really good point!

robertnishihara on 4 Jun 2018

All, can we capture this discussion in a Google doc? The thread is becoming a bit too long to follow, e.g., the pros and cons of various solutions as spread across multiple posts. Also, what is the process to reach a decision? Consensus? Or maybe @raulchen should take all the feedback in and make the decision since he is going to make the changes? Thanks

istoica on 5 Jun 2018

@robertnishihara Yes, my preference is class.remote(...) | class.method.remote(...) > ray.call > remote lambda(adopted in other projects such as Apache Ignite) > Remote##class.method(...) > class##Remote.method(...), with the rational discussed above.

@istoica Yes, Google doc makes it easier to list all the pros and cons. Since @raulchen initiates this discussion, maybe @raulchen you can do that? :) Regarding to the consensus process, I believe @raulchen is doing great work, by posting the options with pros/cons out for broader feedback, and see whether we can get better ideas of how to address the problems better. I suggest that maybe we can come up a POC, and do some experiments against real code (e.g., changing the code in RayaG project in Ant) for comparison before we can really reach a consensus.

imzhenyu on 5 Jun 2018

@imzhenyu having a POC is a great idea.
@raulchen can you put together a short PRD in google docs? Maybe let's plan to make a decision by the end of the week? And, as @imzhenyu mentioned, if we cannot narrow to one choice maybe we can use the POC to try a few alternatives.

Thanks. This is very exciting.

istoica on 5 Jun 2018

@istoica @imzhenyu sure, I'll put together a google doc later today.

raulchen on 5 Jun 2018

Just summarized our previous discussions at https://docs.google.com/document/d/1iv3PLPNHcDKs45MZiAg4N0WNUOubxSTyLjMJVsxVZfg/edit?usp=sharing
Feel free to add what I was missing.
Hopefully we can reach a decision based on this doc. :)

@ericl, support for other JVM languages is a very good point. I also added my thoughts about Scala at the end of the doc.

raulchen on 5 Jun 2018

Thanks @raulchen. I added my preference at the end. It would be great for everyone to add their preferences as well.

istoica on 5 Jun 2018

Just added mine:)

imzhenyu on 6 Jun 2018

I added another potential possibility to the doc, using interfaces & proxies. Unfortunately I'm not familiar enough with them myself to have a great vision of what the API would look like, but they seem like they could be the most idiomatic solution in Java (which relies heavily on interfaces).

mehrdadn on 6 Jun 2018

@mehrdadn thanks for your feedback.
We tried dynamic proxy as well. But it didn't work very well, because:
1) the function signature of the remote class is different than the original class (one returns RayObject, while another returns T);
2) Dynamic proxy doesn't work with static methods;
3) as you've mentioned, users have to define an interface, which is inconvenient especially when you want to migrate a existing project to ray.

raulchen on 6 Jun 2018

@raulchen Ah yeah, you're right... now that I'm looking more carefully at how proxies work in Java it seems like their entirely goal is to keep the signatures matching which doesn't help. :\

I think I'm also a fan of Option 2 then. I guess a more idiomatic Java version of it would generate an interface for the user rather than directly giving back a class, but I'm honestly neither sure if it can get much more idiomatic than this, nor what exactly that might buy anyone.

For addressing the 2^N issue though, could you do something like this and figure it out at call-time?

public static RayObject<Integer> echo(Object... args) { ... }

The performance impact should be acceptable I think?

mehrdadn on 6 Jun 2018

@mehrdadn regarding your suggestion of using Object as arguments, I'm not concerned about perf, but I think it's not ideal to sacrifice type safety.

raulchen on 6 Jun 2018

👍1

@raulchen I've added mine. It sounds like there is no perfect choice right now, but I assume it will be possible to iterate on the decision if we come up with better ideas.

robertnishihara on 8 Jun 2018

👍2

@robertnishihara yes, and in fact I guess we can converge soon as some people in Ant have had some new findings about remote lambdas which may fix the performance issue. Besides, we also have some new requirements (getting annotation information efficiently) which will change the reasoning a little bit that can result a consensus easier. Will have the feedback here once there are some more in-depth results.

imzhenyu on 9 Jun 2018

@imzhenyu interesting, please share the performance numbers for remote lambdas when they are available! What was the issue?

robertnishihara on 10 Jun 2018

First, Lambda serialization is actually cacheable.
Java compiles each appearance of a method reference to one single instance.
It means, if we do

Ray.call(Echo::echo);
Ray.call(Echo::echo);

Ray.call will get 2 different instances.
If we do

for (int i = 0; i < 100; i++) 
    Ray.call(Echo::echo);

Ray.call will always get the same instance.

Thus, to cache serialization, we can simply use the instance as the key. To cache deserialization, we can just use the serialized binary as the key.

I did a benchmark on my Macbook Pro.

+----------+---------------+-----------------+
|          | serialization | deseriliazation |
+----------+---------------+-----------------+
| wo cache | ~0.1ms        | ~0.3ms          |
+----------+---------------+-----------------+
| w/ cache | ~0.001ms      | ~0.001ms        |
+----------+---------------+-----------------+

And the serialized binary is ~600 bytes.

Conclusion:

With cache, the perf of remote lambda is totally okay.
If we remove binary rewriting and only use lambda, Java worker code can be significantly sanitized. Then I think Option 1 is also okay to me.

Note: the approach @imzhenyu mentioned above (proposed by @mylinyuzhi) is actually a method to get the info of the original method. Considering the perf of cache is already very good, I think this is not very necessary to use it for transferring functions. But it can be a good approach to get meta data of annotations at runtime.

raulchen on 10 Jun 2018

Thanks @raulchen, that sounds good! Is this something that would only be used with Solution 1? Or would it also be used in conjunction with solution 2?

robertnishihara on 10 Jun 2018

@robertnishihara yep, it will only be used for solution 1.

raulchen on 11 Jun 2018

@mylinyyuzhi just implemented the above-mentioned solution in #2245.
This PR removes binary rewrite, and significantly simplified Java code base (without sacrificing perf, see commit message).
I think, then, Option 1 should be an okay solution for everybody.
I'll close this issue once #2245 is merged.

raulchen on 13 Jun 2018

Is this the Ray.call() solution?

This solution is still hard to read as Ray.call() doesn't match the syntax of the class declaration like solution 2 does, but still ok.

It would be great to ask the opinion of other Java developers outside Ant regarding solution 1 vs 2.

istoica on 13 Jun 2018

@istoica
yes, that is still Ray.call.
I share the opinion with you that option 2 looks more idiomatic. But option 1 seems to be the only one that everybody would be happy to accept.
Also, I'll keep this issue open to see if we'll hear more voices outside the current group.

raulchen on 13 Jun 2018

Well, my main point was about process. In general with APIs we should get feedback from a larger poll of developers because they are the ones we are building the APIs for.

Also, I know that everyone is familiar with this piece about "How to design a good API and why it matters?" but I always find it useful to re-read it ;-): https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/32713.pdf

(And, in general, the best solution is not the one which everyone is least happy about; in general, the best solution is the one that excel in a particular dimension.)

Anyway, let's continue to monitor this and get feedback from other developers.

istoica on 13 Jun 2018

👍1

Closing because I think this was done in #2245.

robertnishihara on 6 Aug 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

[xray] Crash during object transfers

ericl · 3Comments

rllib: Using gym.RewardWrapper around MultiAgentEnv cause observation mismatch with observation_space

0luhancheng0 · 3Comments

No module named mycustom.py on other node

zhaokang1228 · 3Comments

Bazel build error for boost when building from source on ARM(aarch64)

heavyinfo · 3Comments

Trouble installing in virtual env.

robertnishihara · 3Comments

Ray: [RFC] Improve Java API

Option 1: Ray.call (Current implementation)

Problems

A Potential solution

Option 2: EchoRemote

Pros:

Con

Option 3: x.y.z.remote.Echo

Option N

Most helpful comment

All 41 comments

Conclusion:

Related issues

Option 1: `Ray.call` (Current implementation)

Option 2: `EchoRemote`

Option 3: `x.y.z.remote.Echo`