There is call chain reentrancy deadlock with simple call flow A -> B -> C -> A.
UPD: Not reproducing on 2.1.0 version.
Confirming repro.
Works:
GrainA with [Reentrant].Does not work:
AGrain.Act() instead of AGrain.Act2().Configure<SchedulingOptions>(_ => _.AllowCallChainReentrancy = true).Surprised this wouldn't get picked up by automated tests.
Was there a planned change in default reentrancy behaviour since 2.1.0?
It looks like the issue that @andhesky attempted to fix with #5145. However, we ended up reverting it last minute before releasing 2.2.0 final due to performance regression caused by it (via #5249).
I think 2.1.0 works because of #4382 that we reverted in 2.1.2 via #5086.
@sergeybykov Any news with this issue? I think it is critical issue.
I believe we should implement something along these lines:
ActivationData.RunningRequestsSenders from a HashSet<ActivationId> to HashSet<Message> or HashSet<Guid> and rename to RunningMessages or RunningCallChainsCallChainId property (a Guid or long) to Message.HeadersContainer which is set automatically if it's not already set on the current Message (I believe MarkSameCallChainMessageAsInterleaving is the right place to set it - remove the code which sets IsAlwaysInterleave to true, the determination should be made on the receiving side, not the sending side). EDIT: Message.CallChainId already exists but it's unusedCatalog.CanInterleave to check if the current message's CallChainId is present in RunningMessages/RunningCallChainsHow does that sound?
I don't know the architecture that well, but the solution sounds reasonable. I especially like doing the check on the receiver side. I was a bit surprised that is not how it worked initially, but didn't want to change it. For my own interest, why will that perform better than my solution? Is it because it is not using RequestContext?
I think the tests from my PR should still be good so you may be able to reuse them. My change also tried to reconcile some code duplication between deadlock checking and call chain reentrancy. If you can save that it might make maintenance simpler in the future.
@andhesky primarily because it's not sending a list of ActivationIds along the call chain. Avoiding RequestContext is part of it, too.
The cost of the check is primarily paid when a call would not otherwise be admitted.
~Note that this change would break rolling update compatibility unless AllowCallChainReentrancy is disabled on all clients/silos prior to the upgrade. It can be re-enabled after the upgrade.~ EDIT: Message.CallChainId already exists, so there should be no upgrade-breaking change.
@ReubenBond Looks nice, ill try to fix this within few next days
@ReubenBond I think I can't fix this without deep knowlege of Orleans
Any news on this? I think i might have the same problem....
@ReubenBond
I believe we should implement something along these lines:
One of my production systems just got hit hard by this issue; as in half of our users were essentially locked out for a couple days before we could pinpoint the problem and publish the fix. These call chains get rather hard to decode once they're long enough.
Your suggestion here is exactly what I was just going to suggest, though with less technical details. It seems simple enough. Is there a reason why it hasn't been implemented yet (performance issues perhaps)? I could try to implement it, though I'm not making any promises.
Most helpful comment
@ReubenBond Looks nice, ill try to fix this within few next days