Ray is a community-driven project. We love to learn about existing use cases and how we can help to make Ray more useful. To facilitate that, we would like to create a community-maintained list of project suggestions that can help future contributors decide on what to work on. To facilitate the discussion, here is a preliminary list.
If you are interested in working on any of these or have more suggestions for projects, please comment on this thread, open issues or post to our mailing list!
Please also check out https://github.com/ray-project/ray/issues
These are suggestions to improve Ray's distributed execution engine.
Improve task submission overhead: Profile the current task submission overhead, identify bottlenecks and speed it up. This could for example be done by batching tasks submissions.
Fuzzing for Ray: Automatically uncover bugs and race-condition in Ray using fuzzing.
Code coverage: Track the code coverage of Ray tests (and improve it).
Simplified microservices: Make it very easy to develop and deploy microservices with Ray (automatically creating REST/GRPC interfaces for actors, making it possible to support dockerized actors)
C++ frontend: We already have a frontend for Python and Java. This project entails adding a frontend for C++, which could be useful for certain performance-critical applications like allreduce.
Distributed GC: An alternative design for object eviction. Instead of doing LRU eviction on a per-node basis, implement a more global policy that tracks object usage and frees objects if they are not needed any more.
Actor migration: Make it possible to transfer actors from one node to another. This will enable preemption of nodes with actors.
Operator or task fusion: Speed up the execution of Ray programs by automatically fusing together operators or tasks, e.g. for streaming.
These are projects related to libraries.
Data and model parallel trainig: Develop a library to do data and/or model parallel training on Ray
Tune:
RLlib:
Modin:
Streaming: Implement more operators and improve the performance of https://github.com/ray-project/ray/pull/4126.
This is a list of interesting applications that can be developed on top of Ray. They can serve as examples for others on how to use Ray or evolve into libraries in the future.
Web-crawling: Use Ray to extract information from the web (e.g. a search index) by crawling web-pages.
Training a language model: Extract training data from the web and train a language model with it, see also https://openai.com/blog/better-language-models/.
These are projects that will make it easier to do development with Ray.
Integration with Debuggers: Make it easy to do remote debugging or actors and tasks in Ray.
Integration with IDEs: E.g. write a plug in for VSCode that integrates with the graphical debugger, shows the task timeline or lets users easily start/stop clusters and update their code.
@pcmoritz This is a create great idea! It might be good to include a project for supporting TF 2.0 (related #4134)
@pcmoritz This is a create idea! It might be good to include a project for supporting TF 2.0 (related #4134)
Thanks, added it!
Thanks for summarizing and posting this!
I'd like to share some experiences and work from Ant.
Improve task submission overhead:
Big +1 for this. We're also planning on profiling and improving Ray's performance. One thing we already did is perf metrics, #4246 is first PR and other PRs should also come soon. Besides this, there're a lot of other things that need to be built. E.g., distributed tracing, profiling CPU/memory usage, etc.
Code coverage:
I personally did some research before about adopting codecov.io. The amount of work should be fine. Maybe someone from Ant can work on this.
Distributed GC:
Months ago, we discussed about Batch GC. And we're now prototyping this idea. Other than Batch, do we have a better solution (e.g., automatic distributed GC) at this moment?
Other than the above, I think "Custom task/actor scheduling policy" will also be very useful. E.g., Streaming system needs this feature to collocate actors.
Cross-pollinate project with tensorflow/agents.
Hi All, tensorflow/agents is "A library for Reinforcement Learning in TensorFlow." It's in active development and also compatible with TF 2.0
I believe that cooperation between the the two projects will rip lots of fruits. Thoughts?
make actor's method support async function to improve concurrency.
Visualizations: All metrics are saved as scalar under the same tab in tensorboard, that's it (no histograms, no graphs, no HParams for tune). It would be nice to add more visualizations, for example:
graph=tf.get_default_graph() when instancing FileWriter here.Would a PR that adds beholder tensorboard plugin be of any interest?
Would it be possible to add a progress bar page like Spark? This would be make it easy to track the status of any job that has been deployed on the ray cluster?
NEAT / HyperNEAT algorithms might benefit from Ray scaling
Not sure why you need "Simplified microservices" - seems like a lot of extra work for little payoff. There are already great languages like Erlang/Elixir for that.
I would be really interested to see more straightforward distributed-RL setup - maybe more standardized k8s approach (currently worker/head nodes have to be setup manually, which includes a lot of setup of libraries to make sure that all software is there).
@drozzy I personally use a microservice for better object permanence and to keep the ray code separate from the rest of the codebase. It allows big projects to be worked on without creating a massive monolith (separate repos + containers = godsend).
Also, question. What's the status of putting a transformer inside of the model? I found something on Github that seems to be a meta-learner for RLLib, maybe you can take what they did?
I found something on Github that seems to be a meta-learner for RLLib, maybe you can take what they did?
@kivo360 what do you mean? can you share a link?
@richardliaw my bad, proofreading error. "Someone on GitHub has a meta-learner RL model. Maybe we can take what they created and turn it into a default."
Not sure why you need "Simplified microservices" - seems like a lot of extra work for little payoff. There are already great languages like Erlang/Elixir for that.
I would be really interested to see more straightforward distributed-RL setup - maybe more standardized k8s approach (currently worker/head nodes have to be setup manually, which includes a lot of setup of libraries to make sure that all software is there).
@drozzy Maybe Ray could provide more flexibility than normal microservice since Ray support fine-grained task will be able to do function-level scaling, and you even can write a whole distributed application in one Ray project(orchestrate a branch of components on different nodes). Personally, I think it would be great to have this feature.
Are we going to support kubeflow?
Are we going to support kubeflow?
I second supporting kubeflow as well. kubernetes is the most popular cluster management system and leveraging kubeflow + kubernetes would make it easy for folks to leverage their existing cluster to use ray.
I would like to propose supporting self-play algorithms like AlphaGo, AlphaZero, or MuZero. The following article provides pseudo-code for a MuZero implementation.
@kuonangzhe @anooprh can you say more about what the ideal integration/API would look like? Thanks!
I can't be the only one who would find this useful so maybe I'm just too unfamiliar with Ray to know how to accomplish the same thing, but I'd simply like the ability to "disable" ray.
A lot of times when I'm debugging I end up removing the decorator and calling the method I'm debugging directly, which also requires changing how function parameters are handled and the output from the function calls (e.g., can't use ray.get() anymore).
It would be nice if I could use a config option to essentially tell Ray to not do any of the fancy stuff and just basically do normal synchronous processing (i.e., implement a passthrough mechanism).
A potential use case for this, feasibility unknown, would be facilitating usage on Windows. In theory, you could have a Windows build that implements this passthrough mechanism so that Windows users can at least run the same code even if they don't get the benefits. I presume this would be easier to implement than implementing the full functionality.
I'm getting a buddy that runs Windows to help me on a project that uses Ray. He doesn't actually need the benefits of Ray to do his thing, but it would be great if he could simply run the code as is.
I think one way of achieving this is via ray.init(num_cpus=1). The other
way of achieving this should be ray.init(local_mode=True), though I think
there are a few small known bugs with that option.
On Wed, Jan 8, 2020 at 5:34 PM mstrofbass notifications@github.com wrote:
I can't be the only one who would find this useful so maybe I'm just too
unfamiliar with Ray to know how to accomplish the same thing, but I'd
simply like the ability to "disable" ray.A lot of times when I'm debugging I end up removing the decorator and
calling the method I'm debugging directly, which also requires changing how
function parameters are handled and the output from the function calls
(e.g., can't use ray.get() anymore).It would be nice if I could use a config option to essentially tell Ray to
not do any of the fancy stuff and just basically do normal synchronous
processing.A potential use case for this, feasibility unknown, would be facilitating
usage on Windows. In theory, you could have a Windows build that implements
this passthrough mechanism so that Windows users can at least run the same
code even if they don't get the benefits. I presume this would be easier to
implement than implementing the full functionality.I'm getting a buddy that runs Windows to help me on a project that uses
Ray. He doesn't actually need the benefits of Ray to do his thing, but it
would be great if he could simply run the code as is.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/4417?email_source=notifications&email_token=ABCRZZMS2ZITG3EJ7YKEBEDQ4Z5ILA5CNFSM4G7TZ2R2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIOTLOA#issuecomment-572339640,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABCRZZOPIJVGO7BTKDSK5PDQ4Z5ILANCNFSM4G7TZ2RQ
.
I think one way of achieving this is via
ray.init(num_cpus=1). The other way of achieving this should beray.init(local_mode=True), though I think there are a few small known bugs with that option.
…
It looks like local_mode is it!
So we can update my request to basically be: would it be possible to get a Windows build that implements local_mode quicker than a Windows build that does everything?
Hi all,
First of all, thank you for this great framework. I was wondering if there is a dask.distributed.Client-equivalent in Ray.
The reason for asking about this is the following scenario: Imagine that you have a supercomputer with two different kinds of nodes (CPU-based, and GPU-based), and the administrators have setup two separate ray servers for each kind of node.
I am developing an astrophysics code which evaluates a series of galaxy models simultaneously. The software provides the users with the option to choose what kind of hardware they want to evaluate each of their models on (CPU, GPU, etc). Imagine now that a user wants to (simultaneously) evaluate some models on the CPU and some on the GPU, in the environment like the one described above. I would like to be able to connect to two different ray servers and perform my calculations simultaneously.
To my limited knowledge, Ray doesn't support this because the connection to the server is a global state in the framework. Is that true? Are there any plans to support simultaneous connections to different Ray servers?
Thank you for your time.
@bek0s, I see, so the thing you want to do is to have an application that submits different tasks to different Ray clusters, right?
There are two parts to this.
ray.put() and f.remote() and things like that which don't specify a cluster. It's certainly possible to implement something like this. Right now, the preferred way to do this in Ray is to have a single cluster with CPUs and GPUs and to specify in the application whether the tasks should use CPUs or GPUs.Hi @robertnishihara,
I really appreciate the quick response. It is good to know that the features I would like to see are not impossible to implement due to some fundamental limitation of Ray. Indeed, my use case is quite unusual, and I think the current Ray API design should suffice for most cases. Nevertheless, any future developments related to the above-mentioned features will be more than welcome! :)
Thanks again!
one critical thing missing from Ray versus Multiprocessing is Queues and Pools, just a simple API to set up an endless loop like this:
Envs (Pool) -> Observations (Queue) -> Agents (Pool) -> Actions (Queue) forever
This Pool, Queue, Pool, Queue motif takes no time in multiprocessing but it's unclear in ray and often just hangs with no error messages or anything. That's bread and butter basic stuff for a distributed systems framework, but it's not stable reliable benefit for Ray users. Just imagine a Kanban board. It's really an async pipe of pools and queues
Even just making logs requires stack overflow to find some function to build a logger on all the workers.
Most of the intro to ray docs are oversimplified to the point they aren't useful; for example, the functions in the examples take no arguments so it's not immediately clear to a new user that you're supposed to do x.y.remote(ARGS)
Also, this library is huge, complicated, and the dependencies are huge and complicated, to the point I'm concerned about adopting Ray, it's literally 380,000+ lines black box beast, not saying it could be done better, but it could definitely be a lot simpler, and that would make maintenance a lot easier. Simplicity is a key benefit of good software, and Ray's core API seems simple, but the implementation is complicated and that holds it back
Thanks a bunch for the feedback @bionicles! BTW a question about dependencies - what would be ideal here? Reducing extraneous dependencies in a slimmed-down core install? Moving away from the monorepo?
Most helpful comment
I second supporting kubeflow as well. kubernetes is the most popular cluster management system and leveraging kubeflow + kubernetes would make it easy for folks to leverage their existing cluster to use
ray.