Azure-functions-host: Question about CancellationToken's

Created on 9 Jun 2020  路  6Comments  路  Source: Azure/azure-functions-host

Is your question related to a specific version? If so, please specify:

N/A

What language does your question apply to? (e.g. C#, JavaScript, Java, All)

C#

Question

My question is about the CancellationToken passed into C# functions: https://docs.microsoft.com/en-us/azure/azure-functions/functions-dotnet-class-library#cancellation-tokens

We have observed that occasionally when we are in the middle of running a function, the CancellationToken is raised and causes all our async calls to stop. If a CancellationToken is signaled, we do expect this behavior, but I have a more general question about what causes this CancellationToken to be signaled in the first place. According to the documentation, I am left with a sort of vague feeling of understanding on the topic.

It seems like it just can happen any time the operating system says "it is time to shut down what we are working on". But I have no idea under which conditions this can and will happen when hosting functions within Azure. Does this ever happen as a result of functions running on the consumption plan needs to "scale out" under the hood? If not, what exactly (in the context of hosting within Azure on the consumption plan) causes this to happen?

Further more, should we be passing the CancellationToken along to async calls? I feel like the answer is yes, but maybe this is our mistake in the first place. Maybe we aren't supposed to pass the CancellationToken along at all, but it would feel so strange.

For a little more context on how we came across this issue/question in the first place:

  • We have a TimerTrigger function that runs once a day which queues tens of thousands of messages onto an Azure Storage Queue using the Azure.Storage.Queues nuget package.
  • We have a QueueTrigger function that picks up messages from the queue and makes calls to an HTTP dependency.

What we observed is that our TimerTrigger function was failing often after a couple thousand messages, with exceptions along the lines of TaskCanceledException. My first guess was that the function was timing out after 10 minutes, but it usually happens only a minute into execution.

My working theory is that after queuing messages begins, the QueueTrigger function begins to process them with high levels of parallelization. However, after a certain threshold the underlying infrastructure (ie: the consumption plan VMs) determine that in order to handle the load it should scale out. If this is the case, maybe the functions that were started on the original machine (ie, the TimerTrigger function) are signaled to wrap up what they are doing via the CancellationToken since the VM is being free'ed up/abandoned.

I realize that the internal workings of Azure are not something that are typically discussed in regards to how functions are scaled, etc. But I would really like to know what exactly is causing my TimerTrigger's CancellationToken to be signaled.

For anyone that runs across this issue and wonders how we ended up solving at (at least in the short term): We decided to take our logic in the TimerTrigger and refactor into a durable function. So basically the TimerTrigger simply kicks off an OrchestrationTrigger and we have an ActivityTrigger that queues a much smaller number of messages at a time. We call that ActivityTrigger from the OrchestrationTrigger with a retry policy, which allows the process to complete gracefully even when a CancellationToken is signaled in the ActivityTrigger function (which does happen about 10-15% of the time, from my initial testing).

All 6 comments

@mhoeger Bump! Any guidance would be very helpful.

Hey @APIWT - sorry for the delay! Cancellation typically happens when an instance stops or is recycled - this could happen because of scale-in (we used to need 5 instances but things are looking more stable so we're going to scale down to 4) or because app content/settings are changed or because of some other exception / getting to an unhealthy state. I'm kinda surprised to hear your situation about scale-out causing cancellations? Would be good to understand your specific scenario I think...
Can you share your function app name, either directly or indirectly? This will help us investigate. Thanks!

Hey @mhoeger! Thanks for getting back to me.
The following invocation ID should make it easy to investigate: cd1a1a77-7c68-4d1c-9557-cb15f7d2dce4
The UTC timestamp is: 2020-07-07 07:04:39.198

Notice the exception: The I/O operation has been aborted because of either a thread exit or an application request. (if you look in the stack trace you will see a TaskCanceledException)

The function is a durable function activity now, so it tries again and succeeds shortly after. Still would be nice to understand why it happens though :)

Feel free to reach out if you have more questions.

Hey @APIWT ! In your case, it looks like you're running into cancellations of executions because instances are getting to an unhealthy state, are then restarting (canceling executions). From at least one instance and some of our diagnostic dashboards, it looks like your biggest issue is that your app is creating too many outbound connections. You can see the same dashboards I'm looking at by going to the Azure portal, navigating to your function app, selecting Diagnose and solve problems from the left, and finding relevant dashboards (in this case, I think it's under "Functions Host Monitor Thresholds").

Here's some documentation around our best practices for managing the number of outbound calls per instance. This should at least mitigate the number of cancellations you see, but I think there's still work to be done to document when cancellation occurs. And please let me know if the mitigation doesn't work!

@mhoeger Ugh, another one bites the dust related to this issue: https://github.com/Azure/azure-functions-host/issues/5098

We needed to work around scoped services not working properly by newing up a new ServiceCollection and building a provider for every function invocation. Unfortunately, that means our storage queue singleton is not truly a singleton which causes it to be created over and over every invocation.

@mhoeger I opened a somewhat related issue here if you could take a look: https://github.com/Azure/azure-functions-host/issues/6325

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mathewc picture mathewc  路  3Comments

shibayan picture shibayan  路  3Comments

christopheranderson picture christopheranderson  路  4Comments

JasonBSteele picture JasonBSteele  路  3Comments

helgemahrt picture helgemahrt  路  4Comments