Orleans: Orleans cluster reliability and disaster recovery

Created on 13 Dec 2019 · 3Comments · Source: dotnet/orleans

Hi. I've recently started working with Orleans and would like to clarify a couple of issues:

Is Orleans completely responsible for managing the cluster's health or is it a responsibility of a higher-level infrastructure (e.g. K8s)?

According to the presentation, it's managed automatically. However, I don't quite get how a silo would be restarted if it completely fails. There is no central monitoring process to manage it, is it?

Does it make sense to monitor the health of the grains? I suppose it's not needed as Orleans will automatically deactivate them if they fail and we'll get an error in the logs.

question

Source

Vlad-Stryapko

👍2

Most helpful comment

Implement health checks so k8s/swarm may restart the container. Then silo will automatically join the cluster and will start hosting newly activated grains. During node downtime, actors will be spawned on still-alive silos.

yevhen on 17 Dec 2019

👍2

All 3 comments

yevhen on 17 Dec 2019

👍2

According to the presentation, it's managed automatically.

Orleans runtime automatically respond to changes in the hosting environment (silos/nodes added or removed) and reconfigures cluster accordingly. However, it does not try, and cannot really, restart anything.

So, you still need a hosting solution that would ensure that enough nodes are running at any point in time, and that they get restarted in case of a failure. Initially, Orleans was built with Azure Cloud Services in mind. Nowadays, k8s seems to be the most popular choice.

Does it make sense to monitor the health of the grains? I suppose it's not needed as Orleans will automatically deactivate them if they fail and we'll get an error in the logs.

Generally speaking, grains never fail. But a silo where a grain in activated might. In that case, that grain get automatically reactivated in another silo upon a next call to it. Because of that, instead of monitoring an individual grain, people usually monitor health of their service as a whole, by executing synthetic transaction against it or otherwise.