Azure-functions-durable-extension: Orchestration instance stuck in pending status - Functions v1

Created on 22 Jun 2018 · 19Comments · Source: Azure/azure-functions-durable-extension

The issue referenced here is still occurring. Opening new issue per @cgillum request. I can supply any further environment/application information as needed. Thank you.

Needs

Source

pluralmonad

Most helpful comment

Multiple function apps using the same storage account is supposed to work as long as each function app has a unique taskHub name configured in host.json (NOTE: make sure the correct schema is being used - it changed from Functions 1.x to Functions 2.x). This is because all Azure Storage artifacts (queues, tables, blobs) are named using the task hub name.

The problem is that by default all function apps have the same task hub name, so there will be automatic conflicts if two of them use the same storage account. I would love to change this behavior, but unfortunately it's a breaking change, so for now we have to rely on guidance and samples.

@KevinDJones are you able to reproduce your issue when running in Azure? We have telemetry that would help me diagnose what's going on if you can share some information about the stuck instance (instance ID, timestamp, and region is good enough).

cgillum on 13 Dec 2018

👍2

All 19 comments

I'm seeing the exact same behavior on functions v1.

jbrailsford on 26 Jun 2018

@AviateX14 Thanks. Are you running the v1.5.0 version of the extension? If so, can you share your Azure region, a general timeframe, and an orchestration instance ID that I can refer to?

@pluralmonad do you have a repro using the v1.5.0 version of the extension? I'd like to investigate your case as well. The latest version of the extension has some improved instrumentation that will make it much easier for me to figure out what's going on.

cgillum on 26 Jun 2018

Hi, yes, 1.5.0:

Europe West
2018-06-26T17:42:28.975Z
8e4c3d8acdfa41dd8bf2c4fe4d74fea0 (execution ID)

It's worth noting that it intermittently works, and then hangs again. For example now, it has moved to the running state, but hasn't been updated for over 5 minutes - it's an eternal function that runs every 1 minute.

jbrailsford on 26 Jun 2018

Thanks @AviateX14. One thing I noticed is that you seem to have several messages in your control queues that are "stuck" because they are intended for a function named tpgw-worker, which apparently doesn't exist. Did you make changes to your function app in the last 6 or 7 days, such as renaming some of your orchestrator functions? I'm not sure if this is why your instance is getting stuck, but these errors seems to be slowing down the overall processing.

cgillum on 26 Jun 2018

@AviateX14 In fact it looks like there may be as many as 500 different "start" messages that your function app is repeatedly trying to dequeue. If possible, can you try clearing your queues to see if that resolves the issue? Note that this will permanently suspend any existing orchestration instances in this task hub (though you can always recreate them which it looks like you've been doing already).

cgillum on 26 Jun 2018

@cgillum Right on the money. Clearing the queues seems to have fixed the problem. It'd be interesting to know what's happening under the hood here, when I renamed the functions, the previous versions that were listed under the funtions instance were removed, I would've expected that to clear up any other references? There was a brief window where the orchestrator was trying to create an instance of a funciton by it's old - at the time non-existant - name, I wouldn't have expected this to enter anything into a queue? Is there any documentation on what is happening behind the scenes when a durable instance is created/destroyed?

jbrailsford on 26 Jun 2018

@AviateX14 Glad that resolved the issue for you! Normally there is validation ensuring that the named function actually exists before we try to enqueue any messages. Is it possible that in your case is that the function name existed at one point, a bunch of messages were added to the queue, and _later_ the function name was changed? We don't have a good way of catching this, unfortunately. Probably what we need is some form of poison message handling so that these messages can eventually be removed from your queues automatically.

As far as documentation goes, this behavior is actually pretty low-level in the Durable Task Framework. There isn't a lot of documentation on how this works under the hood yet, but something we'll need to consider adding.

@pluralmonad Let me know if you have a recent repro using 1.5.0 so I can investigate yours as well (I believe the issue you were running into was different).

cgillum on 26 Jun 2018

We recently had a similar issue where our Durable Functions were taking 30 minutes to complete up from 30 seconds. We had a bunch of message in the queue's with dequeue counts around 2,000. Once we cleared the queues, performance returned to 30 seconds or under. Is there no way to set a dequeue limit to purge bad messages to a poison queue?

jfelner on 7 Aug 2018

Some big reliability improvements were done in the recent v1.6.0 release which should hopefully address scenarios like this. Please let us know if you continue to run into cases where instances get stuck using the latest extension version.

cgillum on 30 Aug 2018

Still seeing this with v1.6.2 dequeue increasing steadily
Europe West
2018-10-22T13:14:51.0311600Z
17b9c59c959a4c2e9522f97d009ad74d (execution ID)

thllbrg on 22 Oct 2018

@cgillum This is happening to us as well, US East, v1.6.2. Orchestration instances get stuck in Pending for a long time, nothing appears to actually be running. Then after a long time (e.g. ~30 minutes) it will get "unstuck", become Failed, and there will often be a second instance from the same time period that's Completed. For awhile these processes worked fine, then we started seeing more and more of these issues -- we haven't been able to identify changes on our end that correlate to the significant increase in stuck orchestrations.

These issues have completely blown up some critical processes for us and are costing us goodwill with customers.

US East
2018-10-29T13:39:07.694Z
235d8176a603499eadb29e6eec6cc9c7

rogersmj on 29 Oct 2018

Hi @rogersmj. Your symptoms sound like something that can happen if multiple function apps are running on the same storage account. Can you confirm whether this might be the case?

Also, any warnings or errors coming from Application Insights?

cgillum on 29 Oct 2018

@cgillum can confirm that my problems went away when using separate storage accounts.

thllbrg on 29 Oct 2018

👍1

Your symptoms sound like something that can happen if multiple function apps are running on the same storage account. Can you confirm whether this might be the case?

Yes in fact we do have multiple function apps...we somehow missed this was not a good idea to have them using the same storage account. I will try separating them.

rogersmj on 29 Oct 2018

So glad I found this, I had the same issue as @rogersmj. Completely missed function apps need their own storage accounts. Makes sense once you understand the mechanisms behind it.

brainded on 1 Nov 2018

@cgillum Is it still the case that multiple function apps should not be running on the same storage account? Even if they have different Hub names?

I've noticed my function apps often get stuck when connected to a storage account my team and I use (locally), but I never run into it in with my personal storage account.

KevinDJones on 12 Dec 2018

cgillum on 13 Dec 2018

👍2

I have a similar issue but a lot weirder, I'm using v1.4.1 though.
Region: Central US
Instance ID: 1ad213217d434297930ed064688d85cf
Approximate timestamp: 2018-12-12T09:42:00

My logs indicate that calling the DurableOrchestrationClientBase.StartNewAsync method did not return an instance ID until it'd completed the entire orchestration. After the instance ID was returned, checking the status of the orchestration always returned "Pending" even though all activity functions were called and completed successfully. Unfortunately, our logic doesn't count the request as finished until the status of the orchestration is "Completed", "Canceled', "Terminated', or "Failed".
This instance was stuck on Pending for two hours then all of a sudden started an entirely new orchestration that returned an instance ID "23ef921805774f938f0f86bf8efef81b" normally before the entire orchestration was complete. This new orchestration instance returned a "Completed" status correctly.

I don't see a spike in the amount of messages queued during that time. And I don't have multiple function apps using the same storage account.