Azure-functions-durable-extension: JobHost stops when durable function is running

Created on 25 Jun 2019  路  15Comments  路  Source: Azure/azure-functions-durable-extension

We have a durable function that runs an activity function 100 times, one activity function at a time. Each activity function takes approximately 30 seconds to complete. We鈥檝e noticed that after running for around 20 minutes the JobHost is stopped and all activity is suspended for between 10 minutes to an hour before the JobHost starts up with a different HostInstanceId and the durable function continues.

This happens every time with 100 activities. If we reduce the number of times the activity function is called to 10 the issue does not occur. The issue also does not occur when running on a fixed S1 App Service plan. It only occurs on the consumption plan.

The activity function is using a single instance of HttpClient to call an API 100 times, one request at a time.

The activity function uses a CancellationToken which is passed onto the HttpClient requests but it does not appear to be cancelled when the JobHost is stopped. If the activity function exceeds its max runtime then the CancellationToken does appear to be cancelled and the function is gracefully aborted. The CancellationToken also does not appear to be cancelled when running in a fixed app service plan and the Azure Portal UI is used to stop the app service. I see there is an existing fixed bug 4251 for a similar issue but I'm not sure what release the fix is in.

2019-06-25T14:14:25.158 Stopping JobHost    "Category":"Microsoft.Azure.WebJobs.Hosting.JobHostService","HostInstanceId":"98c7727a-d36f-4788-9b43-f31dd4a6c519"                                 
2019-06-25T14:32:03.491 Starting JobHost    "Category":"Microsoft.Azure.WebJobs.Hosting.JobHostService","HostInstanceId":"1d604326-f138-421a-85d3-3042c17a1ef3"

Investigative information

  • Durable Functions extension version: 1.0.29
  • Function App version (1.0 or 2.0): 2.0
  • Programming language used: C#
  • Timeframe issue observed: 2019-06-25T14:14:25.158 UTC
  • Function App name: Starling-IdentityDataIngestion-dev
  • Function name(s): GraphOrchestratorFn & GraphActivityFn
  • Region: West US
  • Orchestration instance ID(s): 6c5f3c2bd674454c91976e60249540a1
azure-app-service bug

All 15 comments

@ian63

I can鈥檛 offer much guidance but I鈥檓 pretty sure that a function app gets unloaded from memory after 20 minutes of inactivity on consumption plan, which sounds like what鈥檚 happening here given that this doesn鈥檛 happen on App Service Plan.

Maybe try adding a plain old cron trigger function to your existing app on a minute schedule and see if this keeps the app from being unloaded while you perform your Durable function workload. If this works successfully, then there could be a bug somewhere.

@olitomlinson Thank you, that looks like it could be the issue. I've tested running a simple function every 10 minutes while the Durable Function is active and for the first time the JobHost did not stop. I'll run a few more tests in the morning to confirm.

Previously the JobHost would be stopped even though it was activity processing an activity function and had been processing at least one every minute since the durable function workload was started.

Successful orchestration instance Id: b5f5b77c5a33419285379af85bb8adf0

@cgillum sounds like a bug?

Interesting. I think @olitomlinson's explanation is correct regarding the 20 minutes of inactivity. The queue messages are probably getting picked up so fast that the scale controller never notices that any messages enter the queue, and therefore thinks the app is idle. We'll need to think of a good way to account for this.

Regarding this:

the JobHost is stopped and all activity is suspended for between 10 minutes to an hour before the JobHost starts up with a different HostInstanceId and the durable function continues.

This long delay surprises me. Ideally it should resume immediately, or in 5 minutes at the longest (this is the default visibility timeout for queue messages). In host.json, if you _decrease_ the workItemQueueVisibilityTimeout to a smaller value, does that change how long it takes to recover?

I checked about the CancellationToken fix for https://github.com/Azure/azure-functions-host/issues/4251, and I was told that it's currently deploying as part of Azure Functions v2.0.125549. Most likely that will finish deploying everywhere by early next week. I'm wondering if that would fix the delay problem as well. Can you let me know if you still have this problem after your function app gets upgraded?

@olitomlinson @cgillum We've implemented a timer function to run every 10 minutes and since doing this we haven't seen the issue occur. The durable function workloads now run to completion without the JobHost stopping after 20 minutes.

Regarding the JobHost not starting back up for between 10 mins to 1 hour. I'm wondering if the JobHost only started back up the next time a function was called. We don't have workItemQueueVisibilityTimeout defined in host.json and the JobHost was regularly stopping for far longer than 5 minutes when it was part way though running a durable function workload.

Thank you for checking on the status of the CancellationToken fix. I'll try it out next week once its deployed.

@cgillum This is interesting to me.

The queue messages are probably getting picked up so fast that the scale controller never notices that any messages enter the queue, and therefore thinks the app is idle. We'll need to think of a good way to account for this.

Could this be applicable to standard Azure Functions (vs Durable Functions) running on a consumption plan as well? I've been seeing similar behavior where my app is scaling in despite having service bus messages still coming through and being picked up by that instance. Role Instances are being abandoned mid processing.

If so, any thoughts so far on how to handle this?

@HobbsB yes, Service Bus queue triggers would suffer from the exact same issue. In the worse case, the impact would be that the app idles-out as quickly as every 20 minutes - at which point it goes through a cold-start as soon as the scale controller realizes that messages are sitting idle in the queue.

We have a work-item created on the App Service platform side to address this problem. It's not yet clear, however, when it will be fixed. If the added latency that occurs every 20 minutes is problematic, the workaround would be to add a timer-triggered function to your function app to keep the process warm.

@cgillum Thanks I appreciate it. Is there anything I can track for this issue to know when it gets resolved?

I'm not 100% sure this is the issue that we're running into with scaling in prematurely, but when that occurs a single message-triggered function has the chance of just stopping midway through it's execution. Because of this, several times a day we end up being left with bad data.

We'll be switching to an App Service plan in the meantime to see if that prevents our problem for now. We're doing that instead because I think a timer trigger is a singleton; so I suppose a timer trigger will only keep 1 instance warm for a scaled out application that has multiple instances running; meaning other instances can still scale in prematurely.

This issue has been automatically marked as stale because it has been marked as requiring author feedback but has not had any activity for 4 days. It will be closed if no further activity occurs within 3 days of this comment.

@HobbsB I'll leave this issue open until a fix to Azure App Service is in place. In the meantime, your workaround to use an App Service plan sounds like a good one.

Hi, If I understand correctly, having a timer trigger of lower than the 20 minute idle time should keep it alive? We have a durable function app running under a consumption based hosting plan, and timers each minute, but still the app seems to stop. Is the only solution still going to an app service plan?

@cgillum has any progress been made to fix this bug?

Also seeing this bug.

We've seen this as well recently with a new function we've built.

Experiencing the same with a Function App on Windows OS, .Net Core 3.1 with Cron timer scheduled to run every 2 min, so there isn't any inactivity. Any news on this?

Was this page helpful?
0 / 5 - 0 ratings