Describe the bug
I have around 15k orchestration instances running daily with sub-orchestrators and many activity functions that was running very fast on sdkVersion=2.0.12382. My function host had FUNCTIONS_EXTENSION_VERSION=~2 so it automatically took the latest version of the runtime.
I can see in App Insights that the execution time increased significantly when the runtime went to 2.0.12427 and is still slow on the latest version 2.0.12438
Investigative information
If deployed to Azure
I have seen that the control queues grow very large with the later version of the runtime and are not evenly distributed. They also take a very long time to clear. This does not happen with the earlier version of the runtime.
In both cases, you were using the same version of Durable Functions, right? So this is more likely a regression in the Azure Functions runtime and not Durable Functions itself?
That's correct. Same version of Durable Functions and same codebase. The only thing different is the FUNCTIONS_EXTENSION_VERSION so it does point to an issue with the runtime
I too am currently trying to get to the bottom of what appears to be a performance regression with my DF, But I鈥檝e not yet fully ruled out regressions in my own dependencies to say for sure that orchestration performance has regressed.
I鈥檓 on ~2 , UK South.
I have also observed very unbalanced control queues. Thousands on 3-4 queues and nothing on the remaining 12-13 queues, despite all my orchestration instances executing exactly the same code-path and hitting the same dependant external services in the activities. I鈥檓 using a random GUID as the Orchestration ID so this should be giving a uniform distribution of load across the control queues.
My findings corroborate with the slow down you are experiencing. https://github.com/Azure/azure-functions-durable-extension/issues/779#issuecomment-496657372
2.0.12408 is the last known good version for me.
Just a quick update. Since forcing my function app to use version 2.0.12382 (for nearly 2 weeks) I have not experienced the slow down that I was seeing every day with the later runtime versions.
@adobedoe, @olitomlinson the Functions runtime team informed me that some Application Insights changes were made which may impact the performance of HTTP triggers and potentially outbound HTTP calls as well. More details here: https://github.com/Azure/azure-functions-host/releases/tag/v2.0.12408
Can you try the suggested workaround, which is to disable the new tracking functionality and see if that improves your performance?
{
"version": "2.0",
"logging": {
"applicationInsights": {
"httpAutoCollectionOptions": {
"enableHttpTriggerExtendedInfoCollection" : false
}
}
}
}
@cgillum changing the setting as you prescribed has made a huge difference!
Posted the results here : https://github.com/Azure/azure-functions-durable-extension/issues/779#issuecomment-499651847
@olitomlinson thanks for confirming!
@adobedoe are you able to confirm as well?
@brettsam FYI, I wonder if we need to reconsider the defaults here if it's having such a noticeable impact on performance.
@cgillum you鈥檙e very welcome, and thank you for getting to the bottom of it.
It was a very nervous time for me because our use-case relies on predictable throughput so we can plan to meet our Customer SLAs.
If this Runtime upgrade turns out to be the root cause of the regression, will there be actions to understand how DF can better protect itself from regressions in the Host Runtime?
I鈥檓 not against fixing the version of the Host Runtime I integrate against, and I should probably do that as my own act of diligence - lesson learned for me.
But I believe in aspirations, so I would like to not fix the Runtime version, but instead have confidence that a minor/patch runtime upgrade would not affect performance as drastically as it did in this example.
@cgillum - I've updated the host.json as described and using the latest runtime (2.0.12493), the throughput is much faster so I'm pretty confident that this is the cause.
For me, this caused a huge degradation in performance which probably would've prevented me from using DF if this was a PoC and not something I'd had running in production for a while so I think it would be good to default the enableHttpTriggerExtendedInfoCollection to false in the runtime at least in the short term.
Thanks for resolving this issue.
@adobedoe thanks for confirming. I agree that this is a pretty serious problem. Unfortunately there is nothing Durable Functions can do to make it better since Durable Functions is just an extension. I'm going to transfer this issue to the Azure Functions team for a resolution.
@lmolkova -- is there anything tracking this issue on the App Insights side?
@brettsam we are tracking this in internal VSO backlog. I also created github issue https://github.com/microsoft/ApplicationInsights-aspnetcore/issues/900
Has anyone else hit this recently I'm trying to track down repros so we can check our mitigations. Thanks in advance.
I think this could be closed now as it was resolved AFAIK?
Agreed.