Ray: Ray project suggestions

Created on 19 Mar 2019 · 26Comments · Source: ray-project/ray

Ray is a community-driven project. We love to learn about existing use cases and how we can help to make Ray more useful. To facilitate that, we would like to create a community-maintained list of project suggestions that can help future contributors decide on what to work on. To facilitate the discussion, here is a preliminary list.

If you are interested in working on any of these or have more suggestions for projects, please comment on this thread, open issues or post to our mailing list!
Please also check out https://github.com/ray-project/ray/issues

Ray core

These are suggestions to improve Ray's distributed execution engine.

Improve task submission overhead: Profile the current task submission overhead, identify bottlenecks and speed it up. This could for example be done by batching tasks submissions.
Fuzzing for Ray: Automatically uncover bugs and race-condition in Ray using fuzzing.
Code coverage: Track the code coverage of Ray tests (and improve it).
Simplified microservices: Make it very easy to develop and deploy microservices with Ray (automatically creating REST/GRPC interfaces for actors, making it possible to support dockerized actors)
C++ frontend: We already have a frontend for Python and Java. This project entails adding a frontend for C++, which could be useful for certain performance-critical applications like allreduce.
Distributed GC: An alternative design for object eviction. Instead of doing LRU eviction on a per-node basis, implement a more global policy that tracks object usage and frees objects if they are not needed any more.
Actor migration: Make it possible to transfer actors from one node to another. This will enable preemption of nodes with actors.
Operator or task fusion: Speed up the execution of Ray programs by automatically fusing together operators or tasks, e.g. for streaming.

Libraries

These are projects related to libraries.

Data and model parallel trainig: Develop a library to do data and/or model parallel training on Ray
Tune:
- Optimize training for TPUs
- Add new architecture search algorithms
- Add components for diagnostics during training
- Warm-starting a model
- Exposing a first-class tracker/logger
RLlib:
- Add new models (e.g. Deep ResNets or Transformers)
- Add new tasks/examples (e.g. language based ones)
- Improve PyTorch support
- TensorFlow 2.0 support
Modin:
- Automatically pick number of partitions
- Speed up operators with Apache Arrow
Streaming: Implement more operators and improve the performance of https://github.com/ray-project/ray/pull/4126.

Applications

This is a list of interesting applications that can be developed on top of Ray. They can serve as examples for others on how to use Ray or evolve into libraries in the future.

Web-crawling: Use Ray to extract information from the web (e.g. a search index) by crawling web-pages.
Training a language model: Extract training data from the web and train a language model with it, see also https://openai.com/blog/better-language-models/.

Development

These are projects that will make it easier to do development with Ray.

Integration with Debuggers: Make it easy to do remote debugging or actors and tasks in Ray.
Integration with IDEs: E.g. write a plug in for VSCode that integrates with the graphical debugger, shows the task timeline or lets users easily start/stop clusters and update their code.

Source

pcmoritz

👍29

Most helpful comment

Are we going to support kubeflow?

I second supporting kubeflow as well. kubernetes is the most popular cluster management system and leveraging kubeflow + kubernetes would make it easy for folks to leverage their existing cluster to use ray.

anooprh on 4 Dec 2019

👍7

All 26 comments

@pcmoritz This is a ~~create~~ great idea! It might be good to include a project for supporting TF 2.0 (related #4134)

gehring on 20 Mar 2019

👍2

@pcmoritz This is a create idea! It might be good to include a project for supporting TF 2.0 (related #4134)

Thanks, added it!

pcmoritz on 20 Mar 2019

Thanks for summarizing and posting this!

I'd like to share some experiences and work from Ant.

Improve task submission overhead:
Big +1 for this. We're also planning on profiling and improving Ray's performance. One thing we already did is perf metrics, #4246 is first PR and other PRs should also come soon. Besides this, there're a lot of other things that need to be built. E.g., distributed tracing, profiling CPU/memory usage, etc.

Code coverage:
I personally did some research before about adopting codecov.io. The amount of work should be fine. Maybe someone from Ant can work on this.

Distributed GC:
Months ago, we discussed about Batch GC. And we're now prototyping this idea. Other than Batch, do we have a better solution (e.g., automatic distributed GC) at this moment?

Other than the above, I think "Custom task/actor scheduling policy" will also be very useful. E.g., Streaming system needs this feature to collocate actors.

raulchen on 21 Mar 2019

👍2

Cross-pollinate project with tensorflow/agents.
Hi All, tensorflow/agents is "A library for Reinforcement Learning in TensorFlow." It's in active development and also compatible with TF 2.0
I believe that cooperation between the the two projects will rip lots of fruits. Thoughts?

jarlva on 19 Apr 2019

👍1

make actor's method support async function to improve concurrency.

gravitywp on 22 Apr 2019

Visualizations: All metrics are saved as scalar under the same tab in tensorboard, that's it (no histograms, no graphs, no HParams for tune). It would be nice to add more visualizations, for example:

Use different tab names for different metrics (e.g. loss, hyperparameters, ...)
Add computation graph in tensorboard (this might only require to pass graph=tf.get_default_graph() when instancing FileWriter here.
Add L2 norm of network weights and/or gradients in tensorboard to diagnose training issues.
Use tensorboardX to visualize non-tensorflow models.
Support tensorboard-HParams to visualize tune results #4528

FedericoFontana on 10 May 2019

👍3

Would a PR that adds beholder tensorboard plugin be of any interest?

dulex123 on 20 May 2019

👍3

Would it be possible to add a progress bar page like Spark? This would be make it easy to track the status of any job that has been deployed on the ray cluster?

alokgogate on 15 Jun 2019

NEAT / HyperNEAT algorithms might benefit from Ray scaling

bionicles on 19 Jun 2019

👍1

Not sure why you need "Simplified microservices" - seems like a lot of extra work for little payoff. There are already great languages like Erlang/Elixir for that.
I would be really interested to see more straightforward distributed-RL setup - maybe more standardized k8s approach (currently worker/head nodes have to be setup manually, which includes a lot of setup of libraries to make sure that all software is there).

drozzy on 26 Jul 2019

👀2

@drozzy I personally use a microservice for better object permanence and to keep the ray code separate from the rest of the codebase. It allows big projects to be worked on without creating a massive monolith (separate repos + containers = godsend).

Also, question. What's the status of putting a transformer inside of the model? I found something on Github that seems to be a meta-learner for RLLib, maybe you can take what they did?

kivo360 on 3 Aug 2019

I found something on Github that seems to be a meta-learner for RLLib, maybe you can take what they did?

@kivo360 what do you mean? can you share a link?

richardliaw on 3 Aug 2019

@richardliaw my bad, proofreading error. "Someone on GitHub has a meta-learner RL model. Maybe we can take what they created and turn it into a default."

kivo360 on 3 Aug 2019

Not sure why you need "Simplified microservices" - seems like a lot of extra work for little payoff. There are already great languages like Erlang/Elixir for that.
I would be really interested to see more straightforward distributed-RL setup - maybe more standardized k8s approach (currently worker/head nodes have to be setup manually, which includes a lot of setup of libraries to make sure that all software is there).

@drozzy Maybe Ray could provide more flexibility than normal microservice since Ray support fine-grained task will be able to do function-level scaling, and you even can write a whole distributed application in one Ray project(orchestrate a branch of components on different nodes). Personally, I think it would be great to have this feature.

gravitywp on 15 Aug 2019

👍1

Are we going to support kubeflow?

kuonangzhe on 23 Sep 2019

👍5

Are we going to support kubeflow?

anooprh on 4 Dec 2019

👍7

I would like to propose supporting self-play algorithms like AlphaGo, AlphaZero, or MuZero. The following article provides pseudo-code for a MuZero implementation.

josjo80 on 13 Dec 2019

👍2

@kuonangzhe @anooprh can you say more about what the ideal integration/API would look like? Thanks!

robertnishihara on 23 Dec 2019

I can't be the only one who would find this useful so maybe I'm just too unfamiliar with Ray to know how to accomplish the same thing, but I'd simply like the ability to "disable" ray.

A lot of times when I'm debugging I end up removing the decorator and calling the method I'm debugging directly, which also requires changing how function parameters are handled and the output from the function calls (e.g., can't use ray.get() anymore).

It would be nice if I could use a config option to essentially tell Ray to not do any of the fancy stuff and just basically do normal synchronous processing (i.e., implement a passthrough mechanism).

A potential use case for this, feasibility unknown, would be facilitating usage on Windows. In theory, you could have a Windows build that implements this passthrough mechanism so that Windows users can at least run the same code even if they don't get the benefits. I presume this would be easier to implement than implementing the full functionality.

I'm getting a buddy that runs Windows to help me on a project that uses Ray. He doesn't actually need the benefits of Ray to do his thing, but it would be great if he could simply run the code as is.

mstrofbass on 9 Jan 2020

👍1

I think one way of achieving this is via ray.init(num_cpus=1). The other
way of achieving this should be ray.init(local_mode=True), though I think
there are a few small known bugs with that option.

On Wed, Jan 8, 2020 at 5:34 PM mstrofbass notifications@github.com wrote:

I can't be the only one who would find this useful so maybe I'm just too
unfamiliar with Ray to know how to accomplish the same thing, but I'd
simply like the ability to "disable" ray.

A lot of times when I'm debugging I end up removing the decorator and
calling the method I'm debugging directly, which also requires changing how
function parameters are handled and the output from the function calls
(e.g., can't use ray.get() anymore).

It would be nice if I could use a config option to essentially tell Ray to
not do any of the fancy stuff and just basically do normal synchronous
processing.

A potential use case for this, feasibility unknown, would be facilitating
usage on Windows. In theory, you could have a Windows build that implements
this passthrough mechanism so that Windows users can at least run the same
code even if they don't get the benefits. I presume this would be easier to
implement than implementing the full functionality.

I'm getting a buddy that runs Windows to help me on a project that uses
Ray. He doesn't actually need the benefits of Ray to do his thing, but it
would be great if he could simply run the code as is.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/4417?email_source=notifications&email_token=ABCRZZMS2ZITG3EJ7YKEBEDQ4Z5ILA5CNFSM4G7TZ2R2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIOTLOA#issuecomment-572339640,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABCRZZOPIJVGO7BTKDSK5PDQ4Z5ILANCNFSM4G7TZ2RQ
.

richardliaw on 9 Jan 2020

I think one way of achieving this is via ray.init(num_cpus=1). The other way of achieving this should be ray.init(local_mode=True), though I think there are a few small known bugs with that option.
…

It looks like local_mode is it!

So we can update my request to basically be: would it be possible to get a Windows build that implements local_mode quicker than a Windows build that does everything?

mstrofbass on 9 Jan 2020

❤1 👍1

Hi all,

First of all, thank you for this great framework. I was wondering if there is a dask.distributed.Client-equivalent in Ray.

The reason for asking about this is the following scenario: Imagine that you have a supercomputer with two different kinds of nodes (CPU-based, and GPU-based), and the administrators have setup two separate ray servers for each kind of node.

I am developing an astrophysics code which evaluates a series of galaxy models simultaneously. The software provides the users with the option to choose what kind of hardware they want to evaluate each of their models on (CPU, GPU, etc). Imagine now that a user wants to (simultaneously) evaluate some models on the CPU and some on the GPU, in the environment like the one described above. I would like to be able to connect to two different ray servers and perform my calculations simultaneously.

To my limited knowledge, Ray doesn't support this because the connection to the server is a global state in the framework. Is that true? Are there any plans to support simultaneous connections to different Ray servers?

Thank you for your time.

bek0s on 1 Feb 2020

@bek0s, I see, so the thing you want to do is to have an application that submits different tasks to different Ray clusters, right?

There are two parts to this.

Allowing you to have a Python script connect to a Ray cluster from outside of the Ray cluster. We are planning to do this, but it isn't implemented yet. There was some preliminary work on this a while back in https://github.com/ray-project/ray/pull/2478, but it never got merged.
Allowing you to connect to multiple clusters. The current Ray API is not designed for this, because you call ray.put() and f.remote() and things like that which don't specify a cluster. It's certainly possible to implement something like this. Right now, the preferred way to do this in Ray is to have a single cluster with CPUs and GPUs and to specify in the application whether the tasks should use CPUs or GPUs.

robertnishihara on 1 Feb 2020

Hi @robertnishihara,

I really appreciate the quick response. It is good to know that the features I would like to see are not impossible to implement due to some fundamental limitation of Ray. Indeed, my use case is quite unusual, and I think the current Ray API design should suffice for most cases. Nevertheless, any future developments related to the above-mentioned features will be more than welcome! :)

Thanks again!

bek0s on 1 Feb 2020

👍1

one critical thing missing from Ray versus Multiprocessing is Queues and Pools, just a simple API to set up an endless loop like this:

Envs (Pool) -> Observations (Queue) -> Agents (Pool) -> Actions (Queue) forever

This Pool, Queue, Pool, Queue motif takes no time in multiprocessing but it's unclear in ray and often just hangs with no error messages or anything. That's bread and butter basic stuff for a distributed systems framework, but it's not stable reliable benefit for Ray users. Just imagine a Kanban board. It's really an async pipe of pools and queues

Even just making logs requires stack overflow to find some function to build a logger on all the workers.

Most of the intro to ray docs are oversimplified to the point they aren't useful; for example, the functions in the examples take no arguments so it's not immediately clear to a new user that you're supposed to do x.y.remote(ARGS)

Also, this library is huge, complicated, and the dependencies are huge and complicated, to the point I'm concerned about adopting Ray, it's literally 380,000+ lines black box beast, not saying it could be done better, but it could definitely be a lot simpler, and that would make maintenance a lot easier. Simplicity is a key benefit of good software, and Ray's core API seems simple, but the implementation is complicated and that holds it back

bionicles on 15 Jul 2020

👍2

Thanks a bunch for the feedback @bionicles! BTW a question about dependencies - what would be ideal here? Reducing extraneous dependencies in a slimmed-down core install? Moving away from the monorepo?

richardliaw on 30 Aug 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

[rllib]How to load the pre-trained PPO model with a lower overhead?

xudongliao · 3Comments

[tune] PBT hyperparam_mutations does not allow for nested dicts any more

timonbimon · 3Comments

format.sh script returns illegal option -o pipefail

1beb · 3Comments

[rllib] In multiagent environment, is timesteps_total the total timesteps per agent or over all agents?

coreylowman · 3Comments

Unrecognised instruction error running valgrind tests.

robertnishihara · 3Comments