Azure-functions-durable-extension: Durable entity stays locked

Created on 17 Apr 2020 · 20Comments · Source: Azure/azure-functions-durable-extension

Description

I am using a durable entity to store an identifier that is globally shared. I'm using a critical section, to ensure that a specific activity runs exclusively. This is based on the bank account example in the documentation.

During development, I notice that when something happens inside this critical section, the context/entity stays locked, and will never unlock. I have no idea how to fix this, except for clearing the storage account. Such events typically involve hitting a breakpoint/exception and then stopping the process.

Unfortunately, this is not limited to during development. I also notice this on production (in Azure). I suspect that the host gets killed/restarted in the middle of a critical section. It then starts again and won't continue.

The DurableTask.Analyzers do not indicate any problems. I am not aware of violating any of the critical section rules.

Expected behavior

Issue should be detected by runtime (after a restart, timeout?), or at least a documented method exists to unlock the context/entity.

Actual behavior

The context/entity locks, fails mid-way and never unlocks. Subsequent invocations halt because the lock is held.

The runtime logs messages such as the one below, for every invocation (I'm using a timer to start an orchestrator every minute):

Function '*** (Orchestrator)' is waiting for input. Reason: WaitForLockAcquisitionCompleted:aac581b8-c88d-5ef3-a31d-87fd45eca3de.

Relevant source code snippets

I have a reproducable example here: https://github.com/basilfx/Azure-EntityLocked. The README.md includes steps to reproduce.

Known workarounds

Clear the storage account, then restart the function.

App Details

C#, Extension version 2.2.1 (also 2.1.1 is affected). Functions V2 (locally and in Azure).

Screenshots

Not applicable.

If deployed to Azure

I cannot share all details of similar code that runs in production. The issue can be reproduced locally.

bug

Source

basilfx

👍2

All 20 comments

Hi @basilfx, thanks for giving us a complete solution for repro, very helpful indeed!

~~Looks like a message is lost somehow, causing the orchestration to hang and never release the lock. I'll need to investigate more why this is happening.~~

Oddly, the message seemed to reappear while I was typing this comment and the orchestration completed! So I think what's happening with the criticial section is that the message delivery mechanism is "recovering" from the shutdown as designed, but only slowly:

Azure queue delivers the lock grant message to orchestration
the orchestration enters the critical section and stops at breakpoint
stopping the application at this point means orchestration work item does not finish
therefore lock grant message does not get permanently deleted from the queue yet
after the visibility timeout (5 minutes) the message reappears in queue
If app is still running, message is delivered to orchestration
the orchestration resumes the critical section, completes it, and releases the lock

So it is actually working correctly, it is just recovering from the "crash" rather slowly because of how this design uses the Azure Queues visibility timeout to ensure reliable delivery.

I validated that this by watching the message appear in the control queue, using storage explorer. When I shut down the app at the breakpoint all control queues look empty but five minutes later the message reappears in one of them.

Does this address your concerns? Or were there more issues with this example that we need to be aware of?

sebastianburckhardt on 17 Apr 2020

Hi @sebastianburckhardt, thanks for your response. I'll look into your answer in detail on Monday, but so far I have this issue in a production application (similar to the code provided), where this issue does not resolve (itself). It is in this 'locked' state for more than a few days already, with the same 'WaitForLockAcquisitionCompleted' message over and over. I did not restart this function, yet.

basilfx on 17 Apr 2020

I do observe recovery on my local machine with the example code, but on my production environment (with very similar code), I don't observe that:

The context/entity is locked since 27th of March, after an activity threw an exception. From AI I can see that the orchestrator was invoked after the exception, and resulted with state 'Failed'.
I did not observe a host restart/crash around that timestamp.

So maybe something else is going on, that results in similar problems.

I'm now going to restart the function, and see if it picks up.

basilfx on 20 Apr 2020

So the restart (with a pause of >10 min) did not help. It is still locked.

I tried to mimic the production situation: I modified the example to throw an exception in the activity, but it still continues to run. However, on production this was the moment where it got locked.

basilfx on 20 Apr 2020

Sounds to me like there is a bug somewhere. My guess it that the exception in the activity had something to do with it. I will have a closer look at the exception paths.

sebastianburckhardt on 20 Apr 2020

I tested the exception paths but they seem to work as intended except for one small corner case.

If for some reason Dispose() is not called at the end of the critical section and it throws an exception we fail to release locks (this is a bug I want to fix) and that could explain the behavior.

However, if you are using the using clause correctly, as I am quite sure you do (as in the repro), Dispose is always called so that is not the explanation.

sebastianburckhardt on 20 Apr 2020

I was able to gather some more details by looking at our internal logs and using the instance ids you indicated.

The lock is apparently held up by the orchestration TableTriggerLogTimer-Orchestrator with instance id "62066f52a90f4990bacfd912c5fad62a" - that orchestration does not complete and does not return the lock so things are stuck.

The logs indicate that this orchestration successfully called an activity TableTriggerLogTimer-Activity, and that activity then failed. It looks like the orchestrator then observed that exception but did not exit the critical section nor terminate, but instead went into some mysterious await.

Was your code catching the exception and then calling an await? For example, Task.Delay(..)? If so that can explain the problem because it is unsafe to call await in an orchestrator unless you are awaiting one of the framework-provided asynchronous operations.

sebastianburckhardt on 20 Apr 2020

@sebastianburckhardt Thanks for your replies.

The exact code of the orchestrator in production is similar to the example. No mysterious await calls, the only difference is the context.IsLocked before the critical section (which also does not work, but probably because I use it wrong).

I cannot share the activity code, but it is connecting to some SQL database, fetches a few rows and returns an identifier. It runs in about 1-60 seconds range.

The function itself runs on a consumption plan.

basilfx on 20 Apr 2020

Yes, context.IsLocked is not meant for that purpose, it just tells you whether you are inside a critical section (so it would always return false in your example).

The activity code is not important... the mystery is in the orchestrator, specifically why it does not complete and release the lock after receiving the exception. Can you see the status and history of the instance "62066f52a90f4990bacfd912c5fad62a" in your table storage? Maybe there are some clues there.

sebastianburckhardt on 20 Apr 2020

Unfortunately I had to wipe the whole storage, since it was the easiest/only way to get the function running again.

I do remember that at the time of the exception, the last successful invocation had status 'Completed', but the one after the exception (and the others after that) had status 'Running'.

After taking a second look at the exception thrown by the orchestrator, it is a bit different exception than the one by the function (activity: SocketException, orchestrator: EndOfStreamException, but database related). I was presuming that the exception thrown by the orchestrator would be the one thrown by the activity.

The stack trace of the orchestrator exception traces back to dependency injection startup code (do note that the orchestrator does not rely on any dependency). The production code uses constructor-based dependency injection, and has a scoped database connection. My hypothesis is that the orchestrator code did not even run, because it already failed constructing the function. So the easy fix would be to move the activity to another function class (including the database connection, which is only used by the activity).

But if my hypothesis is correct, then this is something that should not lock the orchestrator, right?

basilfx on 20 Apr 2020

Yes, what you are saying could be right I think. I noticed that the logs are not replaying the orchestrator correctly, perhaps because it fails immediately when trying to construct the object.

I also noticed that you are using a non-static function

[FunctionName("TableTriggerLogTimer-Orchestrator")]
public async Task OrchestratorAsync(

which I would recommend to replace with a static one

[FunctionName("TableTriggerLogTimer-Orchestrator")]
public static async Task OrchestratorAsync(

because we have seen some issues around that (see #1293). However I want to confirm that this is really what is happening before drawing any definite conclusions.

sebastianburckhardt on 21 Apr 2020

👍1

I have confirmed that throwing an exception in the constructor of the object that contains the orchestrator function causes the same symptoms I saw in the log : the orchestration consumes the event but does not actually execute (stays in running state). So that is indeed the likely explanation.

Clearly, the current behavior is not ideal, we should handle this more gracefully. We should perhaps terminate the orchestration with a nondeterminism exception, since that is sort of what it is (it is a failure to deterministically replay previous behavior of the orchestrator).

That still leaves the problem of releasing all the locks for an orchestrator that does not replay; I may need to write some custom code that can determine the held locks from the history without having to replay the orchestrator.

sebastianburckhardt on 21 Apr 2020

I've changed the orchestrator to a static function, which is an acceptable solution. I'll keep monitoring in the mean while, and will follow this issue for the definitive solution.

Thanks for the help!

basilfx on 21 Apr 2020

We'll need to check and/or fix the following things:

Make sure to catch constructor failures for non-static orchestrators, entities, and activities
Make sure to release all locks of a failed orchestrator, no matter how it fails

sebastianburckhardt on 22 Apr 2020

Point 1. above has been addressed by #1336.

sebastianburckhardt on 28 Apr 2020

@sebastianburckhardt

That still leaves the problem of releasing all the locks for an orchestrator that does not replay; I may need to write some custom code that can determine the held locks from the history without having to replay the orchestrator.

For those of us that have encountered this issue, is there a way to recover from this scenario without nuking the storage?

If not what is the best way of hard resetting the storage account?

Skylertodd on 14 Aug 2020

👀1

I can't think of a simple solution. My best idea right now is to implement a "repair" operation on entities that releases the lock if it belongs to a non-running orchestration. To start with we can make this explicitly invokable, and then maybe automate it at some point.

sebastianburckhardt on 18 Aug 2020

I am considering adding the following operation to IDurableEntityClient; this would also address #1065.

/// <summary>
/// Removes unnecessary metadata and releases orphaned locks. Affects only entities that have been idle for at least 30 minutes.
/// </summary>
/// <remarks>Locks are considered orphaned if the orchestration that holds them is no longer running for some reason.</remarks>
/// <param name="cancellationToken">Cancellation token that can be used to cancel the query operation.</param>
/// <returns>A task that completes when the operation is finished for all entities.</returns>
Task CleanEntityStorageAsync(CancellationToken cancellationToken);

Internally this would be implemented using existing query functionality.

sebastianburckhardt on 18 Aug 2020

Orphaned locks can now be fixed by cleaning the entity storage.

sebastianburckhardt on 5 Oct 2020

Just grave digging here to say wow @basilfx, what a well-written bug report. Exemplary!