Azure-functions-durable-extension: Durable timer fire-at events not being handled

Created on 13 Mar 2019 · 10Comments · Source: Azure/azure-functions-durable-extension

Describe the bug

I have a sub-orchestration that monitors a long running process by using two different durable timers. One durable timer is used to timeout the monitoring task if the task hasn't completed within a certain threshold (e.g. 12 hours). The other durable timer is used control the frequency of polling the status of the request (e.g. once every 20 minutes). All of this is running under a consumption plan app service. What I am seeing is that timer fired messages become visible in control queues; however, the scale controller is not waking up my function app to process the timer fired events (both the polling interval timer and the timeout timer). If I invoke a function in the function app or otherwise wake up the function app by restarting the app service from the portal, the messages are processed and the orchestration continues. According to the documentation about timers in durable functions, "...when running in the Azure Functions consumption plan, the newly visible timer message will ensure that the function app gets activated..."

Investigative information

Durable Functions extension version: 1.7.1
Function App version (1.0 or 2.0): 2.0
Runtime version: 2.0.12342.0
Programming language used: C#
Orchestration ID: 279a0bd455d8466882be52f9c249f076:1
Excecution ID: 0e166e33240644dbbf6208caff8920ee
Timestamp: 2019-03-12T23:28:13.258Z
Region: West Europe

| PartitionKey | RowKey | Timestamp | EventType | ExecutionId | FireAt |
|----------------------------------|----------------|------------------------|-----------|--------------------------------|------------------------|
|279a0bd455d8466882be52f9c249f076:1|000000000000001A|2019-03-13T13:36:57.092Z|TimerFired |0e166e33240644dbbf6208caff8920ee|2019-03-13T10:47:17.726Z|
|279a0bd455d8466882be52f9c249f076:1|000000000000001B|2019-03-13T13:36:57.092Z|TimerFired |0e166e33240644dbbf6208caff8920ee|2019-03-12T23:48:13.233Z|

The first row is the timeout timer. It should have fired at 2019-03-13T10:47:17.726Z. The control queue message became visible on time, but the message itself was not handled until the App Service was touched (2019-03-13T13:36:57.092Z), 2.8 hours late.

The second row is the polling timer. It should have fired at 2019-03-12T23:48:13.233Z. The control queue message became visible on time, but the message itself was not handled until the App Service was touched (same value as the first row), 13.8 hours late.

Here is the sub-orchestration code

``` c#
public static class BatchOrchestrator
{
[FunctionName(nameof(BatchOrchestrator))]
public static async Task Run([OrchestrationTrigger] DurableOrchestrationContextBase orchestrationContext)
{
var batchRequest = orchestrationContext.GetInput();

    var batchActivityResult = await orchestrationContext.CallActivityAsync<Result<BatchResponse>>(nameof(BatchActivity), batchRequest);

    if (!batchActivityResult.Ok) return new Result(batchActivityResult);

    var taskId = batchActivityResult.Data.TaskId;

    using (var cancellationTokenSource = new CancellationTokenSource())
    {
        var timeoutInterval = TimeSpan.FromHours(12);
        var timeoutFireAt = orchestrationContext.CurrentUtcDateTime.Add(timeoutInterval);
        var timeoutTask = orchestrationContext.CreateTimer(timeoutFireAt, cancellationTokenSource.Token);

        while (!timeoutTask.IsCompleted)
        {
            var pollingInterval = TimeSpan.FromMinutes(20);
            var pollingFireAt = orchestrationContext.CurrentUtcDateTime.Add(pollingInterval)
            await orchestrationContext.CreateTimer(pollingFireAt, CancellationToken.None);

            var checkTaskActivityResult = await orchestrationContext.CallActivityAsync<Result<TaskStatus>>(nameof(CheckTaskActivity), taskId);

            if (!checkTaskActivityResult.Ok)
            {
                cancellationTokenSource.Cancel();
                return new Result(checkTaskActivityResult); 
            }

            var taskStatus = checkTaskActivityResult.Data;

            if (taskStatus.Status != "Completed") continue;

            cancellationTokenSource.Cancel();

            return taskStatus.ExitCode == 0
                ? new Result()
                : new Result(nameof(CheckTask), HttpStatusCode.ExpectationFailed, taskStatus.FailureCategory, taskStatus.FailureMessage);
        }
    }

    return new Result(nameof(BatchOrchestrator), HttpStatusCode.RequestTimeout);
}

}
```

Expected behavior

After a function app has gone to sleep, the scale controller should wake up a function app when durable timer fire-at messages become visible in control queues.

Actual behavior

Timer fired events become visible in the durable function app's control queues; however, they are not picked up for processing unless the function app is touched (either by restarting the app service, or expanding a function in the portal, or by invoking an http triggered function in the function app)

Known workarounds

Keep app service alive with timer triggered function

Additional context

Development environment
- Visual Studio, C#, netcoreapp2.1
- Microsoft.Azure.WebJobs.Extensions.DurableTask 1.7.1
- Microsoft.Azure.WebJobs.Extensions.Storage 3.0.4
- Microsoft.NET.Sdk.Functions 1.0.26

azure-app-service bug fix-ready

Source

beckettk

Most helpful comment

@pejohans yes, this fix has reached production and is fully rolled out. Sorry for the delay in closing.

cgillum on 1 Aug 2019

👍2

All 10 comments

Thanks for the details. I was able to find the following error in the scale controller logs, which I suspect is related:

System.ArgumentException: Invalid queue name. Check MSDN for more information about valid queue naming.
   at Microsoft.WindowsAzure.Storage.NameValidator.ValidateShareContainerQueueHelper(String resourceName, String resourceType)
   at DurableTask.AzureStorage.AzureStorageOrchestrationService.GetQueueInternal(CloudQueueClient queueClient, String taskHub, String suffix)
   at DurableTask.AzureStorage.Monitoring.DisconnectedPerformanceMonitor.<UpdateQueueMetrics>d__22.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at DurableTask.AzureStorage.Monitoring.DisconnectedPerformanceMonitor.<PulseAsync>d__21.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.Azure.AppService.ScaleController.ActivationTasks.DurableFunctionsActivationTask.<OnExecuteAsync>d__9.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.Azure.AppService.ScaleController.ActivationTask.<ExecuteAsync>d__42.MoveNext()

This prevents the scale controller from monitoring your queues, which is why the timer messages are not waking the app back up.

There is definitely a bug here in the scale controller, especially if your app runs fine on its own. It could be related to the fact that the scale controller uses a different version of the Azure Storage SDK (an earlier one).

Unfortunately the error message doesn't tell why the queue name is invalid. To see if there is a workaround, could you try changing your task hub name to something simpler and then redeploying?

cgillum on 14 Mar 2019

Actually, by chance are you using an app setting expression for your task hub name? I suspect the scale controller may not be correctly resolving the app setting expression, resulting in this error.

cgillum on 14 Mar 2019

To see if there is a workaround, could you try changing your task hub name to something simpler and then redeploying?

The control and workitems queue names are 38 and 39 characters in length respectively (well below the 63 character limit) and are composed of nothing but lowercase, alphanumeric, and dash characters.

Are you using an app setting expression for your task hub name?

Yes.

{
  "version": "2.0",
  "extensions": {
    "durableTask": {
      "hubName": "%TaskHubName%"
    }
  }
}

beckettk on 14 Mar 2019

Thanks for confirming. I looked at the scale controller code and unfortunately it doesn't handle this correctly. Until a scale controller fix is deployed to App Service, the workaround is to use hardcoded task hub names.

cgillum on 14 Mar 2019

We are currently using a timer-triggered function to keep the app service alive and this is working well. I'll report back to my team to see how they would like to proceed. Thank you for taking the time to investigate.

beckettk on 14 Mar 2019

A timer trigger will work to ensure the app stays running. The downside of this approach is that it won't be enough to allow the function app to scale out to multiple instances. For that, you'll need the scale controller to be able to monitor the task hub in your storage account. I'll leave this issue open to track the fix.

cgillum on 14 Mar 2019

That is valuable information. I will feed it back to my team to see how they would like to proceed. Thanks again.

beckettk on 14 Mar 2019

A fix has been merged into the internal Azure App Service repo which correctly handles task hub names containing app settings references. However, it will likely take 1-2 months before it reaches production due to the current deployment schedule. I will leave this issue open until it has been deployed everywhere.

cgillum on 5 Apr 2019

What is the status of this fix, has it reached production yet? Is it fully rolled out?

pejohans on 1 Aug 2019

@pejohans yes, this fix has reached production and is fully rolled out. Sorry for the delay in closing.

cgillum on 1 Aug 2019

👍2

Was this page helpful?

0 / 5 - 0 ratings

Related issues

What is the expected behavior if we are waiting multiple times on the same event name, but with a different shape of data?

SimonLuckenuik · 3Comments

Support for HttpRequest/IActionResult in HTTP-specific APIs

YodasMyDad · 3Comments

Make Activity Functions Testable by Creating IDurableActivityContext or DurableActivityContextBase

jakingmsft · 4Comments

Recover from accidental deletion of Task Hub components

anhhnguyen206 · 3Comments

App Lease allows all apps to process activity messages from the work item queue

amdeel · 3Comments