Describe the bug
I have a sub-orchestration that monitors a long running process by using two different durable timers. One durable timer is used to timeout the monitoring task if the task hasn't completed within a certain threshold (e.g. 12 hours). The other durable timer is used control the frequency of polling the status of the request (e.g. once every 20 minutes). All of this is running under a consumption plan app service. What I am seeing is that timer fired messages become visible in control queues; however, the scale controller is not waking up my function app to process the timer fired events (both the polling interval timer and the timeout timer). If I invoke a function in the function app or otherwise wake up the function app by restarting the app service from the portal, the messages are processed and the orchestration continues. According to the documentation about timers in durable functions, "...when running in the Azure Functions consumption plan, the newly visible timer message will ensure that the function app gets activated..."
Investigative information
| PartitionKey | RowKey | Timestamp | EventType | ExecutionId | FireAt |
|----------------------------------|----------------|------------------------|-----------|--------------------------------|------------------------|
|279a0bd455d8466882be52f9c249f076:1|000000000000001A|2019-03-13T13:36:57.092Z|TimerFired |0e166e33240644dbbf6208caff8920ee|2019-03-13T10:47:17.726Z|
|279a0bd455d8466882be52f9c249f076:1|000000000000001B|2019-03-13T13:36:57.092Z|TimerFired |0e166e33240644dbbf6208caff8920ee|2019-03-12T23:48:13.233Z|
The first row is the timeout timer. It should have fired at 2019-03-13T10:47:17.726Z. The control queue message became visible on time, but the message itself was not handled until the App Service was touched (2019-03-13T13:36:57.092Z), 2.8 hours late.
The second row is the polling timer. It should have fired at 2019-03-12T23:48:13.233Z. The control queue message became visible on time, but the message itself was not handled until the App Service was touched (same value as the first row), 13.8 hours late.
Here is the sub-orchestration code
``` c#
public static class BatchOrchestrator
{
[FunctionName(nameof(BatchOrchestrator))]
public static async Task
{
var batchRequest = orchestrationContext.GetInput
var batchActivityResult = await orchestrationContext.CallActivityAsync<Result<BatchResponse>>(nameof(BatchActivity), batchRequest);
if (!batchActivityResult.Ok) return new Result(batchActivityResult);
var taskId = batchActivityResult.Data.TaskId;
using (var cancellationTokenSource = new CancellationTokenSource())
{
var timeoutInterval = TimeSpan.FromHours(12);
var timeoutFireAt = orchestrationContext.CurrentUtcDateTime.Add(timeoutInterval);
var timeoutTask = orchestrationContext.CreateTimer(timeoutFireAt, cancellationTokenSource.Token);
while (!timeoutTask.IsCompleted)
{
var pollingInterval = TimeSpan.FromMinutes(20);
var pollingFireAt = orchestrationContext.CurrentUtcDateTime.Add(pollingInterval)
await orchestrationContext.CreateTimer(pollingFireAt, CancellationToken.None);
var checkTaskActivityResult = await orchestrationContext.CallActivityAsync<Result<TaskStatus>>(nameof(CheckTaskActivity), taskId);
if (!checkTaskActivityResult.Ok)
{
cancellationTokenSource.Cancel();
return new Result(checkTaskActivityResult);
}
var taskStatus = checkTaskActivityResult.Data;
if (taskStatus.Status != "Completed") continue;
cancellationTokenSource.Cancel();
return taskStatus.ExitCode == 0
? new Result()
: new Result(nameof(CheckTask), HttpStatusCode.ExpectationFailed, taskStatus.FailureCategory, taskStatus.FailureMessage);
}
}
return new Result(nameof(BatchOrchestrator), HttpStatusCode.RequestTimeout);
}
}
```
Expected behavior
After a function app has gone to sleep, the scale controller should wake up a function app when durable timer fire-at messages become visible in control queues.
Actual behavior
Timer fired events become visible in the durable function app's control queues; however, they are not picked up for processing unless the function app is touched (either by restarting the app service, or expanding a function in the portal, or by invoking an http triggered function in the function app)
Known workarounds
Keep app service alive with timer triggered function
Additional context
Thanks for the details. I was able to find the following error in the scale controller logs, which I suspect is related:
System.ArgumentException: Invalid queue name. Check MSDN for more information about valid queue naming.
at Microsoft.WindowsAzure.Storage.NameValidator.ValidateShareContainerQueueHelper(String resourceName, String resourceType)
at DurableTask.AzureStorage.AzureStorageOrchestrationService.GetQueueInternal(CloudQueueClient queueClient, String taskHub, String suffix)
at DurableTask.AzureStorage.Monitoring.DisconnectedPerformanceMonitor.<UpdateQueueMetrics>d__22.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at DurableTask.AzureStorage.Monitoring.DisconnectedPerformanceMonitor.<PulseAsync>d__21.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.AppService.ScaleController.ActivationTasks.DurableFunctionsActivationTask.<OnExecuteAsync>d__9.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.AppService.ScaleController.ActivationTask.<ExecuteAsync>d__42.MoveNext()
This prevents the scale controller from monitoring your queues, which is why the timer messages are not waking the app back up.
There is definitely a bug here in the scale controller, especially if your app runs fine on its own. It could be related to the fact that the scale controller uses a different version of the Azure Storage SDK (an earlier one).
Unfortunately the error message doesn't tell why the queue name is invalid. To see if there is a workaround, could you try changing your task hub name to something simpler and then redeploying?
Actually, by chance are you using an app setting expression for your task hub name? I suspect the scale controller may not be correctly resolving the app setting expression, resulting in this error.
To see if there is a workaround, could you try changing your task hub name to something simpler and then redeploying?
The control and workitems queue names are 38 and 39 characters in length respectively (well below the 63 character limit) and are composed of nothing but lowercase, alphanumeric, and dash characters.
Are you using an app setting expression for your task hub name?
Yes.
{
"version": "2.0",
"extensions": {
"durableTask": {
"hubName": "%TaskHubName%"
}
}
}
Thanks for confirming. I looked at the scale controller code and unfortunately it doesn't handle this correctly. Until a scale controller fix is deployed to App Service, the workaround is to use hardcoded task hub names.
We are currently using a timer-triggered function to keep the app service alive and this is working well. I'll report back to my team to see how they would like to proceed. Thank you for taking the time to investigate.
A timer trigger will work to ensure the app stays running. The downside of this approach is that it won't be enough to allow the function app to scale out to multiple instances. For that, you'll need the scale controller to be able to monitor the task hub in your storage account. I'll leave this issue open to track the fix.
That is valuable information. I will feed it back to my team to see how they would like to proceed. Thanks again.
A fix has been merged into the internal Azure App Service repo which correctly handles task hub names containing app settings references. However, it will likely take 1-2 months before it reaches production due to the current deployment schedule. I will leave this issue open until it has been deployed everywhere.
What is the status of this fix, has it reached production yet? Is it fully rolled out?
@pejohans yes, this fix has reached production and is fully rolled out. Sorry for the delay in closing.
Most helpful comment
@pejohans yes, this fix has reached production and is fully rolled out. Sorry for the delay in closing.