Azure-functions-durable-extension: Orchestration function hangs on Pending state

Created on 3 Nov 2019 · 18Comments · Source: Azure/azure-functions-durable-extension

Description

Sometimes orchestration function can stuck in 'Pending' status. I guess it happens because starting function and setting status is not transnational operation (#520 ). The issue (#520 ) was closed on Nov 22, 2018, but I was able to reproduce it a month ago.

Situation becomes more dramatic if we have code like below (when a new instance should be started only when previous instance completed):

        [FunctionName(nameof(CheckAvailableSqlBatchesTimer))]
        public async Task Run(
            [TimerTrigger("%DataLakeMigrationSchedule%")] TimerInfo timer,
            [OrchestrationClient] DurableOrchestrationClientBase client)
        {
            const string instanceId = nameof(MigrateToDataLakeOrchestrator);
            var instance = await client.GetStatusAsync(instanceId);

            if (instance == null || !(instance.RuntimeStatus == OrchestrationRuntimeStatus.Pending ||
                instance.RuntimeStatus == OrchestrationRuntimeStatus.Running ||
                instance.RuntimeStatus == OrchestrationRuntimeStatus.ContinuedAsNew))
            {
                await client.StartNewAsync(
                    nameof(MigrateToDataLakeOrchestrator),
                    instanceId,
                    null);
            }
            else
            {
                _telemetry.TrackTrace(
                    $"{nameof(MigrateToDataLakeOrchestrator)} function execution was skipped, " +
                    "as there is another instance running.");
            }
        }

In that case a new orchestration instance will never started because previous one stuck in Pending state.

Support information:

Durable Functions extension version: 1.8.3
Function App version: 2.0
Programming language used: C#

bug dtfx

Source

spzSource

All 18 comments

Hi. Can you share the ID of the orchestration that got stuck, and the approximate time and region? I would like to investigate to see what happened in your case.

cgillum on 3 Nov 2019

Hi @cgillum

Just wanted to gather all required information, but noticed that DurableFunctionsHubInstances table was already cleared up. I've configured one more env, which are very close to production in order to reproduce this issue one more time. It's good that the issue is reproduced quite often.

However I've grabbed a bit of information about the latest executable instance from App Insights:

InstanceId: MigrateToDataLakeOrchestrator
InvocationId: d445a421-3b1e-429a-8a5b-3511802196f5
HostInstanceId: 8c434037-ca47-4b60-ba02-c726740d9943
OperationId: 0a6635df3ca3c14aa9c3998a4c043776
Region: NorthEurope
Function app name: faneuruatqas13uatrsrce
Timeframe issue observed (UTC): 2019-10-08T11:26:40.4567725Z - 2019-10-08T11:27:00.173402Z

Timestamp 2019-10-08T11:27:00.173402Z is a time of last run of MigrateToDataLakeOrchestrator function. After that, the timer (see code above) was not able to start a new instance of orchestration function.

Please let me know if provided info is not enough or additional information required. I'm happy to assist.

spzSource on 3 Nov 2019

I've found more on dev environment:

InstanceId: MigrateToDataLakeOrchestrator
Region: NorthEurope
Function app name: faneurfalsbx13rlixqvwtmo
Timeframe issue observed (UTC): 2019-09-28T02:26:00.010Z - 2019-09-28T02:28:00.010Z

Records from _DurableFunctionsHubInstances_ table:

| PartitionKey | CreatedTime | Execution Id | Input | LastUpdateTime | Name | RuntimeStatus | TimeStamp | TaskHubName |
| -------------- |--------------- |--------------- |--------------- |--------------- |--------------- |--------------- |--------------- |--------------- |
| MigrateToDataLakeOrchestrator | 2019-09-28T02:27:00.251Z | null | null | 2019-09-28T02:27:00.035Z | MigrateToDataLakeOrchestrator | Pending | 2019-09-28T02:27:00.010Z | DurableFunctionsHub |

===

InstanceId: MigrateToSdmOrchestrator
Region: NorthEurope
Function app name: faneurfalsbx13rlixqvwtmo
Timeframe issue observed (UTC): 2019-09-07T06:09:00.183Z - 2019-09-07T06:11:00.183Z

Records from _DurableFunctionsHubInstances_ table:

| PartitionKey | CreatedTime | Execution Id | Input | LastUpdateTime | Name | RuntimeStatus | TimeStamp | TaskHubName |
| -------------- |--------------- |--------------- |--------------- |--------------- |--------------- |--------------- |--------------- |--------------- |
| MigrateToSdmOrchestrator| 2019-09-07T06:10:00.481Z | null | null | 2019-09-07T06:10:00.265Z | MigrateToSdmOrchestrator| Pending | 2019-09-07T06:10:00.481Z | DurableFunctionsHub |

spzSource on 5 Nov 2019

👍1

I'm taking a look at the telemetry for your recent MigrateToDataLakeOrchestrator created at 2019-10-08 11:27:00 UTC (thank you for these details, btw) but it's not immediately clear to me why your orchestration got stuck in the Pending status. It looks like everything should have updated correctly. It's worth noting too that it appears to have completed successfully, so the problem seems to be simply that the instance status was not correctly updated.

One thing that's different between this execution and the previous one (which I assume worked correctly) is that the latter orchestration execution completed incredibly quickly (less than 250ms). I suspect there could be a race condition somewhere. One thing you could try is introducing a delay in your orchestration - like a durable timer that sleeps for 1 second - to see if that eliminates the problem.

Either way, let me know if you see other instances of this. I think I'll need to study the code a bit more to see if I can detect a possible race condition with instance status updates.

cgillum on 5 Nov 2019

Hi @cgillum

I suspect there could be a race condition somewhere.

Do you think that it might be a race condition between executions of orchestration function? I was able to reproduce the issue even when timer interval was configured for 5 minutes. My assumption is that it might be a race condition in scope of single function execution somewhere in durable framework.

Either way, let me know if you see other instances of this. I think I'll need to study the code a bit more to see if I can detect a possible race condition with instance status updates.

Thanks! From my side, currently I'm tracking two environments in order to reproduce the issue again. I'll let you know when the issue appears.

spzSource on 6 Nov 2019

@spzSource

Do you think that it might be a race condition between executions of orchestration function? I was able to reproduce the issue even when timer interval was configured for 5 minutes. My assumption is that it might be a race condition in scope of single function execution somewhere in durable framework.

I agree that I think it's a race within a single execution of an orchestration - i.e. it completes too quickly which messes up our instance status update logic. I thought we correctly handled these cases, but I'm wondering if there is a case we missed which you have encountered.

cgillum on 7 Nov 2019

Hi, yes. @cgillum you can reproduce it. Reference #1038

(moved it to separate issue to provide more clarity)

petr-stupka on 10 Nov 2019

Hi @cgillum

Either way, let me know if you see other instances of this.

The issue appeared one more time on Thursday. And again seems that function has completed, but status remained 'Pending'.

InstanceId: MigrateToSdmOrchestrator
InvocationId: 106af85c-64ed-442f-8ae2-8716876ce298
HostInstanceId: 11e2e251-3ec0-47ae-b3da-b71b5e377113
OperationId: cff3e9190e92ce458a64afe547602ff1
Region: NorthEurope
Function app name: faneurtstqas13fo7d32tlve
Timeframe issue observed (UTC): 2019-11-07T12:08:00.16277Z - 2019-11-07T12:10:00.16277Z

Record in DurableFunctionsHubInstances:

| PartitionKey | CreatedTime | Execution Id | Input | LastUpdateTime | Name | RuntimeStatus | TimeStamp | TaskHubName |
| -------------- |--------------- |--------------- |--------------- |--------------- |--------------- |--------------- |--------------- |--------------- |
| MigrateToSdmOrchestrator| 2019-11-07T12:09:00.012Z | null | null | 2019-11-07T12:09:00.024Z | MigrateToSdmOrchestrator| Pending | 2019-11-07T12:09:00.206Z | DurableFunctionsHub |

I didn't mention it in my previous comments, but in my case a function host uses extended sessions with following configuration:

"durableTask": {
    "logReplayEvents": false,
    "controlQueueBatchSize": 4,
    "traceInputsAndOutputs": true,
    "extendedSessionsEnabled": true,
    "extendedSessionIdleTimeoutInSeconds": 30,
    "maxConcurrentActivityFunctions": 3,
    "maxConcurrentOrchestratorFunctions": 3,
    "maxQueuePollingInterval": "00:00:10"
}

I'm thinking is it possible that it might be a defect in durable history caching mechanism (extended sessions), which can potentially cause orchestration function to be in a wrong state.

spzSource on 10 Nov 2019

Hi @cgillum

I think I'll need to study the code a bit more to see if I can detect a possible race condition with instance status updates.

Today we faced same issue in production environment. Do you have any updates about root cause of such issue? How do you think, can update to 2.0 version fix the issue?

spzSource on 9 Dec 2019

Unfortunately I haven't had a chance to investigate this much further. We did fix some race conditions that we discovered in the 2.0 version. If you don't want to take on a breaking change release, you can alternatively add a reference to the Microsoft.Azure.DurableTask.AzureStorage v1.6.5 nuget package to get those same fixes. I don't know whether it will help with this particular situation, however.

If you can share the instance ID and approximate time range for the latest occurrence, then I can take another look to see if anything stands out that I didn't catch last time.

cgillum on 10 Dec 2019

Ah, I looked this again and I think I figured out what's happening.

This is indeed a race condition. We handle the case where a new orchestration is created from scratch. However, we don't correctly handle the race condition where an _existing_ orchestration instance is being overwritten by a new one (which appears to be the case for you)! Unfortunately my previous suggestion will not help you - the issue exists in the most up-to-date versions of Microsoft.Azure.DurableTask.AzureStorage.

I will look into fixing this.

cgillum on 10 Dec 2019

❤1

Hi @cgillum

Thank you!
I'm currently looking at temporal solution until the real fix come into place. Based on your knowledge, will it help to resolve race condition if I add a small delay inside my orchestration function as you suggested above?

spzSource on 10 Dec 2019

Yeah, I think a small delay will mitigate this. Something like Thread.Sleep(TimeSpan.FromSeconds(1)) at the end of your orchestrator function should be plenty.

Note that we normally tell people not to use Thread.Sleep in orchestrator functions, but there is no negative side effect if you do it right before the orchestrator function completes.

cgillum on 10 Dec 2019

👍1

Hi all,

I've got plenty of pending orchestrators from when I switched extendedSessionsEnabled on and off.
Is this something that https://github.com/Azure/durabletask/pull/347 will fix?

What's the best way to provide private information for you guy to review?

andreujuanc on 12 Dec 2019

@andreujuanc I'm aware of two possible causes of orchestrations stuck in the Pending state:

A bug in the extension where sometimes an orchestration completes so quickly that it completes before we have a chance to add the initial Pending status. This only happens when overwriting an existing instance. This is the problem being tracked by this issue and https://github.com/Azure/durabletask/pull/347 fixes this.
A user error when multiple apps (in Azure and/or on a local machine) are configured to the same storage account and task hub. This results in one app stealing messages from another, which can result in us being unable to start an orchestration. Durable Functions v2.x makes this a little less likely to happen, but it's still possible. The workaround is to ensure each of your apps uses a unique task hub name.

Note that in either case, whether the instance actually gets stuck is dependent on timing, so the problem is usually not 100% reproducible. The value of extendedSessionsEnabled has no impact on this behavior.

cgillum on 12 Dec 2019

Sorry I did made a mistake on my post. what extendedSessionsEnabled did was to leave the orchestrator on Running state.

Should I open a different issue for this?

andreujuanc on 12 Dec 2019

@andreujuanc No worries. Yes, please open a different issue for this and include the region, orchestration instance ID, and a general timeframe. Thanks!

cgillum on 12 Dec 2019

Hi @cgillum

Unfortunately the issue appeared again even with fix #347 in place. In overall this is only single case I've observed for a long time, which means that the issue happens not so frequent as before.

The code is pretty much the same:

        [FunctionName(nameof(CheckAvailableSqlBatchesTimer))]
        public async Task Run(
            [TimerTrigger("%DataLakeMigrationSchedule%")] TimerInfo timer,
            [DurableClient] IDurableOrchestrationClient client)
        {
            const string instanceId = nameof(MigrateBatchesOrchestrator);
            var instance = await client.GetStatusAsync(instanceId);

            if (instance == null || !(instance.RuntimeStatus == OrchestrationRuntimeStatus.Pending ||
                instance.RuntimeStatus == OrchestrationRuntimeStatus.Running ||
                instance.RuntimeStatus == OrchestrationRuntimeStatus.ContinuedAsNew))
            {
                await client.StartNewAsync(
                    nameof(MigrateBatchesOrchestrator),
                    instanceId);
            }
            else
            {
                _telemetry.TrackTrace(
                    $"{nameof(MigrateBatchesOrchestrator)} function execution was skipped, " +
                    "as there is another instance running.");
            }
        }

In that particular case the function MigrateBatchesOrchestrator stuck on Running state.

Correlation information:
_InstanceId:_ MigrateBatchesOrchestrator
_Region:_ NorthEurope
_Function app name:_ faneurjatcsl12cmnrsrce
_Time-frame issue observed (UTC):_ 2020-03-17T22:50:00.010Z - 2020-03-17T22:57:00.010Z

Records from DurableFunctionsHubInstances table:
| PartitionKey | RowKey | Timestamp | CreatedTime | ExecutionId | Input | LastUpdateTime | Name | RuntimeStatus |
|---|---|---|---|---|---|---|---|---|
| MigrateBatchesOrchestrator | | 2020-03-17T22:54:00.2321979Z | 2020-03-17T22:54:00.1585313Z | a739fd6bcbf04db0b7d624f3c62e7c85 | null | 2020-03-17T22:54:00.2127342Z | MigrateBatchesOrchestrator | Running |

Dependencies:
Microsoft.Azure.WebJobs.Extensions.DurableTask 2.1.0
Microsoft.Azure.DurableTask.AzureStorage 1.7.0
Microsoft.Azure.DurableTask.Core 2.2.0
Functions Runtime v3
.NET Runtime: NET Core 3.1

Please let me know if additional information required. I'm happy to assist.

spzSource on 20 Apr 2020

Was this page helpful?

0 / 5 - 0 ratings