Orleans: Race condition with interleaved messages using TCS

Created on 17 Jul 2018  Â·  12Comments  Â·  Source: dotnet/orleans

We're extensively using the async batcher pattern described by @hataytuna in this book
https://github.com/hataytuna/Distributed/raw/master/Applied%20Actor%20Model%20with%20Orleans%20v1.1.pdf It is crucial to performance boosts we got from batching in our system.

We figured out that problem we're experiencing with graceful shutdown described in #4757 is due to Orleans not being able to deactivate some of the grains. This only happens when TaskCompletionSource is used (along with the buffering and timer-based completion).

I have a failing test (#4773) for this scenario you can pull and check. On the surface, it looks like after DeactivateOnIdle is called Orleans disposes the timer while there interleaved requests, which leaves tasks in the uncompleted state, thus preventing Orleans from deactivating the grain.

Please, advice what can be done wrt to this issue.

Most helpful comment

Yes, that's part of the bigger issue I mentioned before. I am writing a global fix, but it's touching some critical parts of the runtime, I prefer to take my time and do it right :)

All 12 comments

Found bug in my test. Seem that it works. Now I'm completely confused why exactly the same code (and test) doesn't work properly in my app. I'll post update once I figure out what is wrong

Well, we found that problem is with one of out actor middlewares. After we have removed the problematic actor filter it started to work. We'll be investigating the reasons. Sorry for false alarm.

Glad to hear it.

Cheers,
Hatay


From: Yevhen Bobrov notifications@github.com
Sent: Tuesday, July 17, 2018 9:51 pm
To: dotnet/orleans
Cc: Hatay Tuna; Mention
Subject: Re: [dotnet/orleans] Race condition with interleaved messages using TCS (#4774)

Well, we found that problem is with one of out actor middlewares. After we have removed the problematic actor filter it started to work. We'll be investigating the reasons. Sorry for false alarm.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdotnet%2Forleans%2Fissues%2F4774%23issuecomment-405723457&data=02%7C01%7CHatay.Tuna%40microsoft.com%7C5708dd657a2040cf3c0908d5ec2713f7%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636674574935498088&sdata=tyc5IT0dP8rZGzHOSSEWQzqnW4bfQTswEc8dlSz2rvU%3D&reserved=0, or mute the threadhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAKOjKPGUBg5HG70oA3M9cSV-rZmWy7Wwks5uHk5UgaJpZM4VS11T&data=02%7C01%7CHatay.Tuna%40microsoft.com%7C5708dd657a2040cf3c0908d5ec2713f7%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636674574935498088&sdata=svLI3ST3xq2I0XXBMCfeQG3mNAYVJ2m3L1JXAfZh9XA%3D&reserved=0.

Unfortunately, it's still an issue. I've added more repeats to the test so now it fails more reliably. Please confirm if the test is wrong or there is an issue. None of the things helped and we still have some actors stuck in production environment during rolling restart :(

Ok, guys, now I can definitely confirm that it's the bug in Orleans code.

I've changed this method to:

        public void PrepareForDeactivation()
        {
            SetState(ActivationState.Deactivating);
            deactivationStartTime = DateTime.UtcNow;
            if (!IsCurrentlyExecuting)
                StopAllTimers();
        }

(did custom private build) and now my system shuts down in production smoothly without any issues.

I don't know how to write the test which will fail reliably (to show the presence of the bug) but it can be easily figured out analytically. Take look at this method:

https://github.com/dotnet/orleans/blob/839e95303f897c0bcc6b86222ff03d43b1420e21/src/Orleans.Runtime/Catalog/Catalog.cs#L867-L886

It checks whether the activation is still executing requests and if yes - it subscribes to OnIdle to finish deactivation. But since timers are already stopped by PrepareForDeactivation those requests are hanging since nothing is able to complete them - the timer is gone.

I'v checked all paths to this method and it seems that this fix is right, once the activation is done with outstanding requests the OnIdle will be called and timers will be properly disposed anyways. All the other places has an inavariant that no requests are currently executing (ie Activation collection) and so the timers will be stopped immediately.

This bug was in Orleans since the beginning (as blame trace shows). Not sure how anyone was using this pattern (TaskCompletionSource) without issues (cc @hataytuna).

@yevhen Thanks you so much for this investigation. Trying to understand what was the logic behind this.

So I am investigating on this for some times now, and there are things that I don't understand.

It checks whether the activation is still executing requests and if yes - it subscribes to OnIdle to finish deactivation. But since timers are already stopped by PrepareForDeactivation those requests are hanging since nothing is able to complete them - the timer is gone.

Grain timer tick != Grain request.

When a timer is executing, ActivationData.IsCurrentlyExecuting will still return false, since the timer does not update the ActivationData (if I am not mistaken, single threaded is guarantee with the scheduler). That's why we have a specific method ActivationData.WaitForAllTimersToFinish.

The call StopTimer from ActivationData.PrepareForDeactivation will only prevent new ticks to be scheduled and executed, I don't think it has any impact on a possible currently running tick.

Do you have logic in your grain where some request wait for a reminder to fire?

It's all due to interleaving with timer-based request completion. When the request for deactivation comes in the grain still has some of the requests incompleted. In this case ActivationData.IsCurrentlyExecuting will return true and so deactivation will be delayed until idle. But the problem that those interleaved requests can only be completed by the timer which has been stopped already. It might be possible that at the time when deactivation request is issued the timer is about to tick (no timer tick is currently executing) and interleaved requests are already queued (buffered inside the grain). Stopping the timer on deactivation means no next tick will be delivered and so requests are hanging ...

Ok I think I got it by reading again your failing test. Let me rephrase to see if I understand correctly:

On a given grain, you have multiple pending requests (interleaving) that are waiting a timer tick to complete. If you stop the timer before is has the chance to unblock the pending requests, the grain will never be able to deactivate itself.

I think the issue is more general: if the grain is currently executing a request, and for some reason this request is stuck (deadlock, ect), the deactivation logic on the OnIdle will never be called, so the grain will not be deactivated correctly. I think your fix make sense, but we need to have more generic safeguard to catch this kind of possible issues.

@benjaminpetit thanks for #4830 fix. We also found another problem with deactivation :(

Some of our grains do retries in timer callbacks and have code similar to the one below (simplified):

while (true)
{
    if (TryFoo())
        break;
    await Task.Delay(15);
}

Orleans unable to destroy such actors in time due to this line of code
https://github.com/dotnet/orleans/blob/e2b25a695767ae5746ffe47218c1331427997894/src/Orleans.Runtime/Catalog/Catalog.cs#L963

not having any timeout attached. Basically, the deactivaion procedure hangs while actor is doing something heavy (lengthy) inside timer callback.

In our private Orleans build I added WithTimeout and the simply proceed to force destroy of such actors. This to add to your list of issues of what could go wrong during deactivation stage. You mentioned a general check for actors which stuck while handling requests but this is about actors which stuck during timer callback.

Yes, that's part of the bigger issue I mentioned before. I am writing a global fix, but it's touching some critical parts of the runtime, I prefer to take my time and do it right :)

Fixed with #4883

Was this page helpful?
0 / 5 - 0 ratings

Related issues

SebastianStehle picture SebastianStehle  Â·  4Comments

guopenglun picture guopenglun  Â·  3Comments

danvanderboom picture danvanderboom  Â·  3Comments

scharada picture scharada  Â·  3Comments

Liversage picture Liversage  Â·  4Comments