Orleans: Orleans cluster reliability and disaster recovery

Created on 13 Dec 2019  路  3Comments  路  Source: dotnet/orleans

Hi. I've recently started working with Orleans and would like to clarify a couple of issues:

  1. Is Orleans completely responsible for managing the cluster's health or is it a responsibility of a higher-level infrastructure (e.g. K8s)?

According to the presentation, it's managed automatically. However, I don't quite get how a silo would be restarted if it completely fails. There is no central monitoring process to manage it, is it?

  1. Does it make sense to monitor the health of the grains? I suppose it's not needed as Orleans will automatically deactivate them if they fail and we'll get an error in the logs.
question

Most helpful comment

Implement health checks so k8s/swarm may restart the container. Then silo will automatically join the cluster and will start hosting newly activated grains. During node downtime, actors will be spawned on still-alive silos.

All 3 comments

Implement health checks so k8s/swarm may restart the container. Then silo will automatically join the cluster and will start hosting newly activated grains. During node downtime, actors will be spawned on still-alive silos.

According to the presentation, it's managed automatically.

Orleans runtime automatically respond to changes in the hosting environment (silos/nodes added or removed) and reconfigures cluster accordingly. However, it does not try, and cannot really, restart anything.

So, you still need a hosting solution that would ensure that enough nodes are running at any point in time, and that they get restarted in case of a failure. Initially, Orleans was built with Azure Cloud Services in mind. Nowadays, k8s seems to be the most popular choice.

Does it make sense to monitor the health of the grains? I suppose it's not needed as Orleans will automatically deactivate them if they fail and we'll get an error in the logs.

Generally speaking, grains never fail. But a silo where a grain in activated might. In that case, that grain get automatically reactivated in another silo upon a next call to it. Because of that, instead of monitoring an individual grain, people usually monitor health of their service as a whole, by executing synthetic transaction against it or otherwise.

@yevhen
@sergeybykov
Thank you very much. I believe the issue can be considered resolved as the questions have been answered.

Was this page helpful?
0 / 5 - 0 ratings