Azure-functions-durable-extension: Durable functions not scaling as expected

Created on 10 Nov 2020 · 23Comments · Source: Azure/azure-functions-durable-extension

Description

We're using Python Durable functions in an Elastic Premium function app plan.

Even after dropping our maxConcurrentActivityFunctions setting down to 1, our function app never scales to more than 1 or 2 VMs.

I've seen older posts around the internet saying that Python Durable functions are in preview, and only support scaling to up to 2 VMs. These posts are old though and I'm having trouble confirming if this is still the case.

Are Python Durable functions still in preview? Do they only support scaling to up to 2 VMs?

Needs

Source

marcd123

All 23 comments

Elastic Premium on Linux allows you to scale out to as many as 20 VMs currently. Can you share more info about your app so we can investigate? The name of your app is the most helpful, but otherwise the instance IDs of one of your started orchestration instances is also enough to help us find your app.

cgillum on 10 Nov 2020

@cgillum I'm reverting our maxConcurrentActivityFunctions to 1 and will run a test. I'll send you an orchestration instance ID right after.

marcd123 on 10 Nov 2020

App Information in the meantime:

Function Runtime: 3.0
Function Core Tools: 3.0.2931
Python Version: 3.6
Service Plan: EP2

Using custom Linux docker container for extra system dependencies.

FUNCTIONS_WORKER_PROCESS_COUNT: 6
maxConcurrentActivityFunctions: 1
maxConcurrentOrchestrationFunctions: 1

marcd123 on 10 Nov 2020

@cgillum Here is our most recent orchestration instance ID

58e77e51a7f748828c0c820a3d786ff4

This was taken from the PARTITIONKEY column of the DurableFunctionsHubHistory table. We're using Live Metrics from App Insight and it's only reporting one server running.

marcd123 on 10 Nov 2020

@marcd123, unfortunately at this time we don't have internal telemetry for orchestration instance ids running on Linux, which is our main blocker for GA for Durable Functions on Python.

Could you share your application name with us? If you are uncomfortable with that, you can share it privately.

ConnorMcMahon on 10 Nov 2020

@ConnorMcMahon The app name is labelright-dev in US East

marcd123 on 10 Nov 2020

@marcd123

So the first thing that I noticed is that you using v1 of the extension bundles. I believe that at the time that we GA Durable Python we will drop support for v1 of extension bundles and only support v2 of the extension bundles (@davidmrdavid can you confirm this?). You can read more about extension bundles here. It should provide various reliability/performance improvements, so I highly recommend this transition even if it isn't required. Note that this will change the default task hub name, so if you want to continue listening to your already in-flight orchestrations, you will need to set your "hubName" in the host.json to the Durable 1.x default of "DurableFunctionsHub".

As for why you had scaling issues, from what I can tell, your application almost always had empty control and activity queues, meaning all work that could be scheduled already was, so scaling out would not be helpful. There were a few caveats to that, where occasionally we would see a few work items that had been in the queue for ~300 seconds (or some multiple of 300 seconds). To me, that indicates that the function app that had fetched those messages shutdown for some reason. When we grab a queue message to schedule work on it, it becomes invisible for 5 minutes, so if something happens to the worker that has that work scheduled, the message will be retried in about 5 minutes. Unfortunately, I don't have the telemetry to back this up at this time, though we will hopefully be in a state where we can get this in the near future.

As for maximizing performance, with a maxConcurrentActivityFunctions and maxConcurrentOrchestrationFunctions set to 1, you will use at most 2 concurrent functions at once, and you have FUNCTIONS_WORKER_PROCESS_COUNT set to 6, so you are not fully utilizing the workers that you do have. Consider increasing your concurrency counts to make full utilization of those worker processes. Also, if your activity functions are IO heavy and not as heavy on CPU, consider making them async, as that can increase your concurrency as well.

We can continue diving in to see if we can verify the hypothesis I mentioned above as to why you are seeing some delays processing your messages.

ConnorMcMahon on 11 Nov 2020

👍1

Thanks for the suggestions @ConnorMcMahon.

About bundles, you're right: after GA we will only support extension bundles v2 and above

davidmrdavid on 11 Nov 2020

Thanks for your recommendations!

My hope is setting concurrency of 3 activity and 3 orchestrators max will make full use of our 6 function workers. We'll update our extension bundles as well.

One thing I'm having trouble understanding about the queues is when control and work items are dequeued, and how many are dequeued.

Lets say that my function app has these limits:
maxConcurrentActivityFunctions: 3
maxConcurrentOrchestrationFunctions: 3

Does that mean a single instance would only dequeue work/control items until those limits are reached? If not, is it possible to set how many items are dequeued and when?

My goal is to have each instance only dequeue as many items as it can process in parallel. This forces work/control items that aren't running to sit in the queue, getting older, and possibly causing other VMs to start up.

marcd123 on 11 Nov 2020

Side note this is what we updated our host.json to

marcd123 on 11 Nov 2020

👍1

You host.json looks right to me! As for the remaining questions, I'll defer to @ConnorMcMahon

davidmrdavid on 11 Nov 2020

@marcd123

Unfortunately this isn't quite as straightforward as "only grabbing the messages that it can process in parallel". The main reason for this is that all orchestration messages are put on a single queue per partition. The default partition count is 4, so with your host.json settings, each control queue has all of the pending messages from 1/4 of the orchestrations. When you are only scaled to one worker, that worker is listening to the control queues for all 4 partitions.

Currently, we will prefetch control queue messages up until the value for controlQueueBufferThreshold (see host.json docs). The default is 256, which is honestly a far better default for C# than it is for a language like Python where concurrency is a bit more limited. Note that this means we will store up to 256 control queue messages on the worker before we stop polling. These control queue messages can all belong to one orchestration, to 256 separate orchestrations , or most likely somewhere in between.

My guess is that 256 is far too aggressive for the level of concurrency a single worker on your application is able to process. I would try greatly decreasing it to see if your app will more aggressively scale. If your orchestrations tend to be like the function chaining example, where each orchestration is waiting on at most one activity to finish at a time, then keeping the controlQueueBufferThreshold closer to your value for maxConcurrentOrchestratorFunctions is a good idea, as each queue message likely belongs to a unique orchestration.

If you are doing lots of fan-in/fan-out, where your app is waiting for many activities at once, then I would caution about lowering your controlQueueBufferThreshold too low. The reason is that if you poll a bunch of control queue messages, they likely represent only a handful of orchestrations, so reducing your buffer threshold will make each orchestration replay process less messages, without necessarily reducing the number of orchestration instances that you are holding messages for in memory.

I realize that this is a lot of configuration to tweak to maximize performance. One of our highest priorities over the next several months is to make it so that we can more dynamically poll queues and set concurrency based on your apps needs so that this kind of micromanagement is not required.

ConnorMcMahon on 12 Nov 2020

This is wonderful information Connor, thank you!

I will reduce our controlQueueBufferThreshold to be closer to our maxConcurrentOrchestratorFunctions.

Is there a similar buffer threshold or batch size for activity functions? I don't see an option for that in the host.json docs.

I'd appreciate any other changes you think we should make to get more aggressive scaling, I'll test with the setting below and report back

App Settings:

FUNCTIONS_WORKER_PROCESS_COUNT: 6

host.json Settings:

maxConcurrentActivityFunctions: 3
maxConcurrentOrchestratorFunctions: 3
controlQueueBufferThreshold: 3
controlQueueBatchSize: 3
partitionCount: 4
controlQueueVisibilityTimeout: 00:01:00
workItemQueueVisibilityTimeout: 00:01:00

marcd123 on 12 Nov 2020

@marcd123,

On a bright note, at least for your use case, we don't prefetch activity messages, so that should not be too much of an issue in regards to scaling.

One big knob for controlling scaling is partition count. Each partition can be handled by at most one worker at a time, so if your app is already at 1 partition per worker (4 workers in your case), it won't scale your app up for control queue slow downs, because those new workers would be unable to help with the control queue backlog. After that point, we will only scale your app up if we detect a work item queue back log, as any worker can process messages off of the activity queue.

In general, I think the best approach we should take is to maximize your overall scalability, rather than making sure you get as many workers as possible. If you still aren't seeing the total scaled out performance that you want/hope from your app, it may be worth checking if there are any lower-hanging fruit to increase performance, rather than tweaking settings to get a few more workers assigned to your app. Especially on the premium plan, we want to make sure you are getting the best bang-for-your-buck on each worker.

Do you mind giving us a bit more details about your app. Giving us the following information could help us give you recommendations that can have surprisingly large performance gains:

Which of the Durable patterns are you implementing for your orchestrations?
Roughly how many activities/suborchestrators/timers are your orchestrations calling?
What sort of work is your activity function doing? More CPU intensive work, or more IO and external HTTP calls?
What sort of total throughput are you hoping for, in terms of orchestrations and activities per second?

ConnorMcMahon on 12 Nov 2020

@ConnorMcMahon When you use the term "worker" above are you referring to a single VM/instance or a function worker process?

Our application does image processing and machine learning for quality assurance automation. The back-end is linked to our quality assurance UI through SignalR, activity functions are returning relevant information to the UI through a web socket.

At max, a single user can be generating 5 orchestrators (1 parent and 4 sub). Those sub orchestrations chain four activity functions each.

We're doing a combination of function chaining and fan-out/fan-in. For example:

Export a large PNG from a PDF file
Use tensorflow model to detect specific parts of the PNG
Fan out up to four sub orchestrations for the specific parts of the PNG
Sub orchestrations using OpenCV, OCR, and other tools for image analysis
Fan in sub orchestrations results and consolidate into one file
Upload the file

I would say our application is a mix of CPU and IO work. The exporting of images from PDF and OpenCV image analysis can be CPU intensive. We also make external HTTP calls to store/download files in blob storage. There are also external HTTP calls to an OCR API.

marcd123 on 13 Nov 2020

Most activity functions begin or end with an HTTP request to Azure blob to download a JSON or PNG file.

If we are not using async calls, are these HTTP requests blocking other functions running on the same instance? Even if we have multiple FUNCTIONS_WORKER_PROCESSes

marcd123 on 13 Nov 2020

@marcd123,

Thanks for the information. To clarify, when I was using the terminology "Workers" before, I was referring to VM instances. I realize that can be confusing, so from now on I will use "instances" to refer to orchestration instances, "worker processes" to refer to the language workers we spin up for Python (and Node for that matter), and "VMs" to talk about VM instances.

One thing I want to note is that I am not an expert on how Functions on Python works, but below is my understanding.

You can think of each worker process for Python as a single "thread". Each worker process is processing one function execution at a time. If you utilize async calls in your functions (and mark your function itself as async), then while that async call is waiting for IO, the worker process will be able to execute some other function during that time. Otherwise, I do believe that the IO calls will block that worker process.

How Durable Functions is working when it polls activity functions is that it it will keep grabbing messages from the queue until it hits the maxConcurrentActivityFunctions. When it grabs the message, it schedules that function to be executed. Then, the functions runtime will load balance those function executions across your worker processes. I'm not sure how the exact load balancing works here, but I assume it is roughly round robin.

There are two potential bottlenecks here, each with different tradeoffs.

You hit the maxConcurrentActivityFunctions while you still have worker processes free to do work. In this case, you essentially have wasted compute, because you have more work that the VM could be doing, but you hit your max concurrency count.
All of your worker processes are busy, but you haven't hit maxConcurrentActivityFunctions. We keep grabbing new activity functions, and putting them on the backlog of some worker process. In this case, you are fully utilizing the compute you are paying for, but those activity function messages are going to have higher latency, as they will wait to execute until the worker process frees up. This is where utilizing async is helpful, because then the worker process can move on to it's backlog while waiting for IO.

One thing to potentially look into is how long your orchestration functions typically execute for (if you have app insights enabled, this should be straightforward to find out). If the majority of your compute time is spent in activities, it may be worthwhile to keep maxConcurrentActivityFunctions at 6, as even if an activity function is waiting on an orchestration function to finish, it wouldn't be blocked for too long before resuming, and you won't have as much wasted compute.

We are currently looking into ways to make the Durable Functions concurrency settings work more cleanly with the worker process model, but this may take some time, as we don't currently have worker process information exposed to us. I hope that my descriptions above helps in the meantime.

ConnorMcMahon on 16 Nov 2020

@ConnorMcMahon Thank you so much, this is incredibly helpful.

The max number of orchestrations any one user creates is 5 (1 parent and 4 sub). The total number of functions at any one time during those orchestrations is 9 (including the orchestrator functions I just mentioned).

I understand that orchestrator functions get paused while waiting for activities/sub-orchestrations to complete and then are later rehydrated.

While the orchestrator function waits for an activity to complete, is it still using up one of the worker processes? Similarly, is it still using up one of the maxConcurrentOrchestratorFunctions?

If so, I would think we should have 10 worker processes to satisfy the maximum of 9 concurrent functions for a single user. Otherwise, if the paused orchestrators are not consuming these worker processes, I think we can have fewer worker processes and get more CPU power allocated to each function.

marcd123 on 16 Nov 2020

Orchestrator functions while waiting to complete do not take up the worker process nor count against your maxConcurrentOrchestratorFunctions. The reason is that when you reach a yield statement, the function execution is actually completed, but the orchestration itself is still "running". Then, when a new orchestration message for this instance happens (be that an external event, a timer firing, or an activity/suborchestration completed), we will restart the orchestration function from the beginning. This time however, because new events have been added to the history, we may be able to make it further in the orchestration this function execution.

We call this orchestration replay, and you can read more about it in our reliability documentation. This replay behavior is why we enforce code constrains in orchestrator functions. Sometimes it can be helpful to schedule a single orchestration and step through it in the debugger to see how this works.

ConnorMcMahon on 16 Nov 2020

@ConnorMcMahon That's great, I think we can definitely get more out of CPU resources then.

In regard to partitionCount, how is it decided which control queue a requested orchestration goes into?

In our case the partition count is 4. Let's say our first VM only buffers up to 4 orchestrator functions, but we requested 6 orchestrator functions. Do those remaining two orchestrator functions remain in the first VM's control queue? Do they get moved to another control queue after they've aged enough?

marcd123 on 16 Nov 2020

Our algorithm for which partition an orchestration instance id is essentially the below psuedocode:

partitionId = fnv1hash(instanceid) % partitionCount

This essentially guarantees that unless you are specifically selecting your instance ids to avoid it, the instance ids will be randomly assigned to a partition, ensuring that for large numbers of orchestration instances, they will be split evenly across partitions. It's important to note that this means that suborchestrations will likely be on a different partition than their parent orchestration.

Also note that our maxConcurrentOrchestrationFunctions and controlQueueBufferThreshold are per VM, not per partition. So when you are only on one VM at first, all 4 partitions will be on that app. As you scale out, the partitions will be reassigned to the new VMs, meaning once you hit 4 VMs, each partition will get full access to the maxConcurrentOrchestrationFunctions and controlQueueBufferThreshold.

ConnorMcMahon on 16 Nov 2020

Ah brilliant! I was not aware that a single VM will look at all control queues when scaled in to 1 VM.

That about concludes all the major questions I have. @ConnorMcMahon Thank you so much for all the help this has really clarified a lot of things I couldn't decipher from the docs and testing alone.

I'll close the issue with this comment. I hope others can make use of the information you've shared on this thread.

Cheers!

marcd123 on 16 Nov 2020

❤1

Thank you so much @ConnorMcMahon , for all the detailed answers here. I might use this as a reference for future performance questions, this was a really insightful exchange

davidmrdavid on 17 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings