Orleans: Simple call chain reentrancy deadlock A->B->C->A (with repro)

Created on 18 Mar 2019 · 10Comments · Source: dotnet/orleans

There is call chain reentrancy deadlock with simple call flow A -> B -> C -> A.

Grains.zip

UPD: Not reproducing on 2.1.0 version.

Source

MelnikovIG

👀2 👍1

Most helpful comment

@ReubenBond Looks nice, ill try to fix this within few next days

MelnikovIG on 26 Mar 2019

👍2

All 10 comments

Confirming repro.

Works:

Marking GrainA with [Reentrant].
Downgrading to 2.1.0.

Does not work:

Recursive call to AGrain.Act() instead of AGrain.Act2()
.Configure<SchedulingOptions>(_ => _.AllowCallChainReentrancy = true).
Upgrading to 2.3.0-rc2.

Surprised this wouldn't get picked up by automated tests.
Was there a planned change in default reentrancy behaviour since 2.1.0?

JorgeCandeias on 18 Mar 2019

It looks like the issue that @andhesky attempted to fix with #5145. However, we ended up reverting it last minute before releasing 2.2.0 final due to performance regression caused by it (via #5249).

I think 2.1.0 works because of #4382 that we reverted in 2.1.2 via #5086.

sergeybykov on 19 Mar 2019

👍1

@sergeybykov Any news with this issue? I think it is critical issue.

MelnikovIG on 20 Mar 2019

I believe we should implement something along these lines:

Change ActivationData.RunningRequestsSenders from a HashSet<ActivationId> to HashSet<Message> or HashSet<Guid> and rename to RunningMessages or RunningCallChains
Add a CallChainId property (a Guid or long) to Message.HeadersContainer which is set automatically if it's not already set on the current Message (I believe MarkSameCallChainMessageAsInterleaving is the right place to set it - remove the code which sets IsAlwaysInterleave to true, the determination should be made on the receiving side, not the sending side). EDIT: Message.CallChainId already exists but it's unused
Modify Catalog.CanInterleave to check if the current message's CallChainId is present in RunningMessages/RunningCallChains

How does that sound?

ReubenBond on 25 Mar 2019

I don't know the architecture that well, but the solution sounds reasonable. I especially like doing the check on the receiver side. I was a bit surprised that is not how it worked initially, but didn't want to change it. For my own interest, why will that perform better than my solution? Is it because it is not using RequestContext?

I think the tests from my PR should still be good so you may be able to reuse them. My change also tried to reconcile some code duplication between deadlock checking and call chain reentrancy. If you can save that it might make maintenance simpler in the future.

andhesky on 25 Mar 2019

@andhesky primarily because it's not sending a list of ActivationIds along the call chain. Avoiding RequestContext is part of it, too.

The cost of the check is primarily paid when a call would not otherwise be admitted.

~Note that this change would break rolling update compatibility unless AllowCallChainReentrancy is disabled on all clients/silos prior to the upgrade. It can be re-enabled after the upgrade.~ EDIT: Message.CallChainId already exists, so there should be no upgrade-breaking change.

ReubenBond on 25 Mar 2019

@ReubenBond Looks nice, ill try to fix this within few next days

MelnikovIG on 26 Mar 2019

👍2

@ReubenBond I think I can't fix this without deep knowlege of Orleans

MelnikovIG on 16 May 2019

👍1

Any news on this? I think i might have the same problem....

Lionhunter3k on 19 Dec 2019

@ReubenBond

I believe we should implement something along these lines:

One of my production systems just got hit hard by this issue; as in half of our users were essentially locked out for a couple days before we could pinpoint the problem and publish the fix. These call chains get rather hard to decode once they're long enough.

Your suggestion here is exactly what I was just going to suggest, though with less technical details. It seems simple enough. Is there a reason why it hasn't been implemented yet (performance issues perhaps)? I could try to implement it, though I'm not making any promises.