Orleans: How to let timers survive a (graceful) silo shutdown

Created on 5 Mar 2018 · 19Comments · Source: dotnet/orleans

I have a Game grain, which manages the logic for a turn-based game between two players. Inside this grain, a timer is used to measure the time for each turn and forcefully end a player's turn if they fail to play in a timely manner.

If this grain gets deactivated (even gracefully), the timer will cease to exist. The creation of a new activation of this grain will depend on it receiving external calls, which means it may not happen until one of the players makes a move, which defeats the purpose of the timer. I can save the game's state and let it continue off from where the old silo was shut down, but I currently have no way of bringing the timer back online.

One way is to constantly poll the grain from another grain. This fails because the other grain is also subject to the same limitations as the original (namely, it may get deactivated with no way to restore its timers).

Another way is to have a second grain (let's call it pinger). One could have the game ask the pinger (on deactivation) to call it back after a second or so. This is likely to succeed most of the time, but if the pinger happens to be on the same silo and get deactivated after the game, it will still fail.

Yet another may be to have the game call itself. However, I'm not entirely sure this is a stable solution as I don't know how the silos handle the case of a grain receiving a call while it's already deactivating but not yet fully deactivated. If the grain is removed from the global directory before the OnDeactivateAsync call this may succeed. I just thought of this while writing this question, so I have yet to see if this actually works.

A reminder is not appropriate in this scenario because the minimum interval for a reminder is 60 seconds IIRC. An entire turn takes about half a minute.

Any other ideas on how to handle this (and other similar cases)?

question

Source

Arshia001

All 19 comments

I should probably also mention that the scenario I'm interested in is maintenance/silo update/etc. where silos are being brought down one by one. I don't want players to automatically lose their games if a silo has to be shut down with prior notice.

Arshia001 on 5 Mar 2018

Reminders might work: https://dotnet.github.io/orleans/Documentation/Advanced-Concepts/Timers-and-Reminders.html

SebastianStehle on 5 Mar 2018

As I already said, reminders are not suitable for the current scenario.

Arshia001 on 5 Mar 2018

So I tried having a grain call itself (via message passing of course) on deactivation, but it just timed out instead of forwarding the request to a new activation on another silo. What's funny is that the grain seemed to reactivate itself if it was deactivated while the silo was running, only timed out during shutdown. Also tried not awaiting the call, still no luck.

I guess what's really needed is for a grain to be able to ask the silo environment to just reactivate it on a different silo.

Arshia001 on 5 Mar 2018

Sorry, for my short answer. Here is the longer version:

I would use the reminder in combination with timers: https://dotnet.github.io/orleans/Documentation/Advanced-Concepts/Timers-and-Reminders.html#combining-timers-and-reminders

Furthermore you can create a startup Task (IStartupTask) to activate a grain. Both, startup task and reminders should help to keep your grain alive.

SebastianStehle on 5 Mar 2018

I've read that document, and a one-minute reminder is both wasteful and inappropriate (the grain needs to come back online immediately). As for startup tasks, I haven't heard of those, but I'm assuming they're the 2.0 equivalent of bootstrap providers. There will be no silo initialization (just a graceful shutdown) in progress and therefore a startup task won't do any good.

Arshia001 on 5 Mar 2018

@Arshia001 "Immediately" is very hard to achieve in a distributed system. For example, consider that it takes time (by default 30 seconds) to detect a sudden silo failure or a network partition.

It seems to me what you are describing, conceptually, is a reminder but with different set of requirements. For example, upon a graceful shutdown, a grain getting deactivated would tell the service to reduce the reminder interval for a quick reactivation, and will reset it back to a longer one upon its successful activation on a new different silo.

You can implement such a service via the [still undocumented] grain service feature, to make it behave exactly how you need. That's how reminders are implemented today.

sergeybykov on 5 Mar 2018

@sergeybykov I don't (currently) care about an unexpected shutdown scenario. The requirement is simple enough, to restore a grain on an active silo in case it gets deactivated gracefully.
I'm always willing to dig into Orleans' undocumented parts. Can you give me a hint as to which part of the source I should start reading?

Arshia001 on 5 Mar 2018

I understand. With grains being reactive entities by nature, you need an external party to trigger their reactivation. The approach I suggested to consider is a variation of the reminder service. There are other potential options, for example, streams. My hunch is that the service approach is superior because it allows for arbitrary logic, and hence can deal with races and other corner cases more intelligently.

Take a look at https://github.com/dotnet/orleans/blob/master/src/Orleans.Runtime/ReminderService/LocalReminderService.cs for how the reminder service is implemented. You may or may not need all of its complexity. But the base logic I think should apply.

sergeybykov on 5 Mar 2018

Thanks. I'll take a look at work tomorrow and post back.

Arshia001 on 5 Mar 2018

👍1

@Arshia001 For scalability reasons, many game engines run the simulation on the client and require server side logic to merely verify the results. Under this approach, one would expect the clients to send updates to the server on turn end (including some sort of 'no action taken' update when users don't react in time). When the server has a timer, it can use that timer to verify that the actions were taken in time, but if it were deactivated, it could use the information from the client end turn updates to reset the timer. There is an edge case where if a grain deactivated a player could cheat and get more time by modifying the data in their response, but that is a very narrow hole, which can be mitigated by cross referencing their turn time data with other players, if it becomes a problem.

jason-bragg on 5 Mar 2018

👍1

@jason-bragg Thanks for the info. That is (almost) the case with my current code. A client makes moves intelligently, then these moves get replicated to the server (one by one, since other players need real time updates on what their opponent is doing). The server simply verifies and replicates the moves to other players. It does, however, keep a copy of the game state against which it can verify each player's moves.

Now, this is all well and good, but I don't like the idea of leaving anything important to client devices. I don't know the tools they use, but I've seen many, many players (most of them teens or children, lacking in-depth knowledge of computers) modify games in some way to get invalid purchases or extra money. Rather than trusting each client and making attempts (most likely futile) to secure the code, I'd rather just keep the logic on the server side and keep every client at arm's length.

There is also another case of the server impersonating a player. In case other players of the same rank are unavailable, some players may be matched against a bot. This bot will try its best to look like a real player. One way to make it look like a player is to have it play on a timer. First, it thinks for a few seconds, then it makes its moves one by one, each with a second or so of delay. Even if I were to restructure the flow of the game and leave more of it to clients, bots would still fail to play in case of a shutdown.

Arshia001 on 6 Mar 2018

So, I just read through GrainService and most of LocalReminderService. If I understand correctly, a grain service is distributed among all silos, with each silo being responsible for a range of grains. If a silo goes down, its range is distributed among all others (this is the most important aspect of services). Grains always access the local service instance. The service instances sync their work via the backing datastore and etags, with which I'm familiar.

So how would I adapt this system to fit my current needs? I'm thinking there will be a KeepAliveService. Grains will call KeepAliveService.KeepAlive(this) to add themselves to the keep-alive list. At this point they're alive, and all is good. Now, if a silo goes down, its ring will eventually be distributed among others. When a service instance receives an OnRangeChange call, it will simply find out which grains were added to its range and call them to wake them up.

Since (I think) the shutdown process deactivates grain activations before destroying services, the new activation should be placed on a new silo and be available for requests. Is this likely to work?

Arshia001 on 6 Mar 2018

Grains always access the local service instance.

Grains talk to different partitions (on different silos) based on into which range hashes of their IDs fall. The rest is correct.

So how would I adapt this system to fit my current needs? I'm thinking there will be a KeepAliveService.

My thinking was similar. Each grain that you care about would register with the service upon activation (from with its OnActivateAsync). The registration will need to be recorded in storage for reliability, and a keep alive timer for that grain will start with a default period. Let's say 30 seconds. This will guarantee that the grain will get reactivated after any kind of failure within ~30 seconds.

When the grain is getting deactivated, it will call the service (from within its OnDeactivateAsync) to reduce the interval for faster reactivation, for example 2 seconds. This will trigger its reactivation on a new silo much sooner. From within OnActivateAsync it'll call the service again and reset the interval back to the default length.

When the grain isn't needed anymore (player left the game, etc.), the grain will unregister from the service.

sergeybykov on 6 Mar 2018

👍1

Makes perfect sense. I'll be back with an implementation tomorrow.

Arshia001 on 6 Mar 2018

👍1

BTW, we are looking at improving the grain service registration API before rc2. Just a heads-up.

sergeybykov on 6 Mar 2018

👍1

So far, I have set up a grain service and I can receive range change notifications. I have one feedback regarding the implementation: It is not clear in any way that a grain service is responsible for setting GrainService.Status to Started, or that it won't receive range change notifications otherwise.

Arshia001 on 7 Mar 2018

I submitted #4155 to improve the registration story for grain services.

Good point about Started. We could automatically set it at the end of GrainService.Start. But that would set it prematurely for those services (like reminders) that need to start in the background. I don't quite have an idea yet how we could solve this cleanly. Do you?

sergeybykov on 7 Mar 2018

That's a good question indeed... How about adding an implementation of StartInBackground to GrainService that just sets Status to Started? It could be abused if it was called at the start of the child class implementation, but a line of comment can solve that problem.

Arshia001 on 8 Mar 2018

Was this page helpful?

0 / 5 - 0 ratings