Orleans: Redistribution of grain activations

Created on 7 Feb 2017 · 7Comments · Source: dotnet/orleans

Merely seeking advice here.

We have quite a large amount of grain activations that once created stay activated forever basically (due to constant reminders refreshing data). When we are doing a deployment, silos are updated one by one, and those grains are activated in the silos available at the time causing uneven distribution of grains. And since those grains are never deactivated, they are never redistributed to other available silos.

I wonder what the best practices are, I think this should be a common scenario.

question

Source

DixonDs

Most helpful comment

Seems like a good opportunity to try activation count based placement. It would not move activations, but will initially balance much better.

gabikliot on 7 Feb 2017

👍2

All 7 comments

Seems like a good opportunity to try activation count based placement. It would not move activations, but will initially balance much better.

gabikliot on 7 Feb 2017

👍2

Another trick we've seen is to have those grains periodically (infrequently) call DeactivateOnIdle() to trigger re-placement of them for rebalancing. A hacky solution, but might get you "automatic" rebalancing.

sergeybykov on 7 Feb 2017

@gabikliot That won't really help, I think. Imagine a case with two silos A and B being redeployed. Silo A goes down, all those grains are activated in silo B; then silo B goes down, and all those grains are activated in silo A and stay there forever.

@sergeybykov thanks, that sounds like a good practical solution, I guess we give it a try. I wonder though if there is anything planned to be done in long term for dynamic rebalancing. I believe there was some chat in gitter mentioning some academic paper on related matter, but now I can't really find it.

DixonDs on 7 Feb 2017

I wonder though if there is anything planned to be done in long term for dynamic rebalancing.

No concrete plans yet. In general, we've been concerned that rebalancing purely on the number of activations is a dangerous proposition because there are usually "cold" and "hot" activations, and it's too easy to create an imbalance if the protocol is oblivious to that.

sergeybykov on 7 Feb 2017

I think I found the paper I mentioned: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/06/eurosys16loca_camera_ready-1.pdf The fun part is that @gabikliot seems to be one of co-authors :)

DixonDs on 7 Feb 2017

In ActOp one of the main tricks was to migrate individual grains to collocate them with the grains they communicate the most with. That was to reduce latency and increase throughput, not necessarily to achieve a balanced distribution IIRC.

sergeybykov on 7 Feb 2017

@sergeybykov's suggestion of using DeactivateOnIdle is likely the most practical fix.

As far as the root of the problem, I don't see this as a placement strategy issue, as much as a startup issue. Orleans features are designed to function in a cluster and do not act as expected with standalone silos. The fact that the cluster becomes active and grain can be placed prior to a critical mass (all? most?) of the silos being up allows for these types of unexpected (unintended at least) side effects to occur.

My suspicion is that this class of problem should be addressed as part of the cluster status. A silo should not run reminders or place grains until the cluster is ready, not when the individual silo is ready.