Orleans: [Question] Orleans transaction deadlocks

Created on 7 Jan 2019  路  6Comments  路  Source: dotnet/orleans

I was looking at the Orleans transaction code (as I was trying to implement my own transactions for JournaledGrains), and I noticed some code that will produce deadlocks. I may be wrong, but I would like to raise this question anyway (It will help me avoid or resolve deadlocks in my own implementation).

I noticed that ReaderWriterLock is local to each TransactionQueue and a TransactionQueue is instantiated for every TransactionalState.

There is a LockGroup currentGroup in ReaderWriterLock that is a linked list and when a new transaction requires a lock, a new group is appended to the back of the linked list. This means that locks are queued in the order they have arrived in, right? Doesn't that mean that it is possible that deadlocks can occur?

If there are 2 transactions T1 and T2 both affecting entities E1 and E2, E1 may have its lock group linked list as T1, T2, and E2 may have its lock group as T2, T1.

I would think that the lock groups are ordered by their priority.

Most helpful comment

@isengrim613 See @ReubenBond's blog post with more details on the performance investigation and improvement - https://blogs.msdn.microsoft.com/orleans/2018/12/07/solving-a-transactions-performance-mystery/.

All 6 comments

For transactional resources involved in multiple concurrent transactions, due to the distributed nature of the system, it is possible for state changes to occur in conflicting order, leading to deadlocks. It is unclear how common this case will be as it's highly dependent on the application logic and degree of concurrency. If this happens, the conflicting transactions will fail due to rwlock timeout.

We've spent a fair amount of time investigating how to avoid this, but no generic solution has been found. All solutions found so far depend on cooperation from the application logic. For instance, if resources are accessed in the same order by all concurrent transactions, such deadlocks are avoidable, but the application logic would need be written to act this way, so this is not a generic solution we can build into the infrastructure.

If this happens, the conflicting transactions will fail due to rwlock timeout.

In your experience/tests, what would be an average/ballpark rwlock timeout that I should start looking at? If we are dealing with deadlocks and trying to be highly available, I would imagine this timespan to be quite small. Maybe 3 or 5 seconds? or should it be even shorter?

For instance, if resources are accessed in the same order by all concurrent transactions

Even if resources are accessed in the same order by all concurrent transactions, wouldn't the distributed nature and the unpredictability of network delay possibly cause the messages to reach the resources in a different order that they are accessed in......? Oh yes, you're right. if the application code does,

await resource1.operation();
await resource2.operation();

instead of

await Task.WhenAll(resource1.operation(), resource2.operation());

Then, there is a much lower risk of deadlocks.

For now, I am ordering my lock queue using priority (or my version of priority), which is the time the transaction started. Not fool proof, but I guess it's a start for me to harden.

I would imagine this timespan to be quite small.

For TransactionalState facet we set the lock timeout to 8 seconds by default and made it configurable via TransactionalStateOptions.LockTimeout.

For now, I am ordering my lock queue using priority (or my version of priority), which is the time the transaction started. Not fool proof, but I guess it's a start for me to harden.

The issue here is clock skew. Timestamps from multiple machines in a distributed system will not all be perfectly in synch. Part of the reason for locking the resource is that transaction ordering is not established at the start of the transaction. Instead, it's determined by reconciling the timestamps of all of the participants and taking the largest. Please see CausalClock. The alternative here is to have a centralized transaction id generator which produces ordered ids that can be used for strict transaction ordering. To avoid such a single point of failure (and contention), the ordering information for the transaction cannot be known until the prepare phase.

For TransactionalState facet we set the lock timeout to 8 seconds by default and made it configurable via TransactionalStateOptions.LockTimeout.

Thank you for the suggestion. I will probably use 5 seconds.

The issue here is clock skew.

Yeah, I read the other thread on using timestamp that will cause unnecessary rollbacks if there is clock skew. Fortunately, since i'm writing specific application code instead of a generic solution, I can make the grains that start transactions sync their casual clocks with each other (Or maybe I could make every silo sync with each other, assuming we have negligible clock skew rate).

Thank you for your time in explaining the concepts to me. It helped me pick out a (application specific) solution.

Out of curiosity, the 2.2.0 release of Orleans sets transactions as production ready code and it boasts an up to 4x performance improvement in transactions. I haven't dove into the changes, but is the boost from changing using priority to using locks to order transactions?

@isengrim613 See @ReubenBond's blog post with more details on the performance investigation and improvement - https://blogs.msdn.microsoft.com/orleans/2018/12/07/solving-a-transactions-performance-mystery/.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

gabikliot picture gabikliot  路  4Comments

Vlad-Stryapko picture Vlad-Stryapko  路  3Comments

bobanco picture bobanco  路  3Comments

SebastianStehle picture SebastianStehle  路  4Comments

Liversage picture Liversage  路  4Comments