Hi,
I tried ray yesterday, it's amazing. Thanks for providing us such a good framework.
I deployed it on bare metal machines, and I found that ray head node only has one copy. So if the machine running the ray head node components is down, the cluster will be gone. How could we deploy a cluster with high availability?
cc @raulchen Please let us know if there's a way to make the head node fault tolerant for now.
The only component that makes distinction from head node and other nodes is called GCS (Redis in the head node). We are currently under active development in making this component fault tolerant.
Hi @acmore. As @rkooo567 has explained above, GCS fault tolerance is actively under active development. We expect to finish the basic functionalities within this month.
The new implementation of GCS depends on an external (and customizable) storage for fault tolerance. For now, we only support Redis. And we also plan to support SQL databases in the future.
So, if you have a reliable and Redis-compatible database, GCS will be fault tolerant by the end of this month. If you want to use MySQL as the backend, it will take more time. If you want to use your own database, you can write your own storage plugin for GCS.
Thank you @raulchen. Looking forward to the new feature.
Hey @raulchen may I know the status of this new feature?
@acmore Code is done, we are currently doing tests and fixing bugs. We plan to fully enable the new GCS service in the next release. However, as I mentioned above, if you want to use MySQL or other database as the backend, there is still some work to be done.
Hi, I'm a bot from the Ray team :)
To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.
If there is no further activity in the 14 days, the issue will be closed!
You can always ask for help on our discussion forum or Ray's public slack channel.