Azure-functions-durable-extension: Long delay between starting and triggering orchestration function

Created on 8 Feb 2019  路  11Comments  路  Source: Azure/azure-functions-durable-extension

Describe the bug
I have two functions:

[FunctionName("TriggerViaHttp")]
public static async Task<HttpResponseMessage> TriggerViaHttpAsync(
    [HttpTrigger(AuthorizationLevel.Anonymous, "GET", Route = "start")] HttpRequest request,
    [OrchestrationClient] DurableOrchestrationClient orchestrationClient,
    ExecutionContext executionContext,
    CancellationToken cancellationToken)
{
    await orchestrationClient.StartNewAsync("StartOrchestration", new { });

    return new HttpResponseMessage(HttpStatusCode.OK);
}


[FunctionName("StartOrchestration")]
public static Task StartOrchestrationAsync(
    [OrchestrationTrigger] DurableOrchestrationContext context,
    ExecutionContext executionContext,
    CancellationToken cancellationToken) => Task.CompletedTask;

The problem is in a big delay between execution of orchestrationClient.StartNewAsync and actual execution of StartOrchestration function.

Investigative information

  • Durable Functions extension version: 1.7.1
  • Function App version: 2.0
  • Programming language used: C#

To Reproduce

  1. Send GET /start request

Expected behavior
StartOrchestration function is triggered in milliseconds after sending this request.

Actual behavior
Sometimes StartOrchestration function is triggered after significant amount of time like several seconds (I can even see the messages in %taskHubName%-control queue in the storage that were not consumed immediately)

Additional context

I can reproduce such behavior only if I have another non-durable function deployed to the same function app. Is that even correct to have durable and non-durable functions in the same function app? I can't find anything about it in the documentation.

enhancement performance

Most helpful comment

I get the point about the cold start that it might be up to 30 sec and I think it's something we can live with.

But what I described (6 sec delay) looks like different case (not a cold start) since I can see from logs that instance c5ebb91e036a688675d9c1e68ffd5bf888fffe4ddd19de33f4ad2a13b18f9128 was active like 14 sec before the last (delayed) execution.

Real use-case

I use durable functions for a chat system.
The design of this system is the same as in repository I shared in previous comments - there is a Http trigger to initiate sending a message to chat and an activity function that actually sends this message via web sockets.
Users of this chat are experiencing the following behavior (consider we made some API calls beforehand to prevent a cold start):

  1. User sends a message - message is delivered in 100 ms
  2. User sends a message - message is delivered in 150 ms
  3. User sends a message - message is delivered in 6 sec
  4. User sends a message - message is delivered in 4 sec
  5. User sends a message - message is delivered in 90 ms
  6. User sends a message - message is delivered in 5 sec

etc. so even after a dozens of http requests they still experience these delays.

The long delivery time in some cases (several seconds) is caused by not immediate start of the orchestration. So it looks unacceptable to have such delays for this use case.

From what are you saying I understand that these random delays are caused by backoff polling of the queues in the Durable Task hub to avoid huge storage account charges and that I can avoid polling using only one instance. But in my case I definitely need more than one instance (to make the solution scalable) as well as quick orchestration starting. Maybe there are some configurable settings for this polling period?

All 11 comments

Hi @myarotskaya, do you see this behavior when running locally, in Azure, or both?

Is that even correct to have durable and non-durable functions in the same function app?

Yes, both durable and non-durable functions in the same function app is perfectly fine. I'm not sure how that could factor in, unless another function is consuming all the CPU or if you have some global concurrency throttles configured.

Are there any non-default configuration values in your host.json file?

@cgillum I've deployed to Azure code from this repository - there are only AF version and hubName in host.json file.
You can check the latest invocations of TriggerViaHttp and StartOrchestration functions - the delay before starting the second function is 6 sec.

Investigative information:

  • Timeframe issue observed: 2/10/2019, 2:48:30 PM - 2/10/2019, 2:48:40 PM
  • Function App name: durable-functions-test
  • Function name(s): NonDurableFunction, TriggerViaHttp, StartOrchestration
  • Region: Central US
  • Orchestration instance ID(s): 8728605dab9f09b147cbe58c19facb23863180efe17791d8765720d363485094, c5ebb91e036a688675d9c1e68ffd5bf888fffe4ddd19de33f4ad2a13b18f9128

Events form logs:

Event time: 2/10/2019, 2:48:32 PM
Request name: TriggerViaHttp
Response time: 17.6 ms
Cloud role instance: 8728605dab9f09b147cbe58c19facb23863180efe17791d8765720d363485094
InvocationId: 39952750-8e97-41c4-a071-eb9fa3318223

Event time: 2/10/2019, 2:48:38 PM
Request name: StartOrchestration
Cloud role instance: c5ebb91e036a688675d9c1e68ffd5bf888fffe4ddd19de33f4ad2a13b18f9128
InvocationId: 2db305ba-28e0-4f4d-8caf-e07c9b8903fd

Hi @cgillum! I just found in documentation that

The orchestration process should start within 30 seconds.

So maybe it's expected that I have such delays?

Hi @myarotskaya sorry for the delay, but thank you for the details. I looked at this more closely and I think I have an explanation. I think the instance ID you're referring to is actually 931d6e863df947c282fd12b960bd9647 (that's the one I found which took around 30 seconds).

You are correct that there can be a delay of up to 30 seconds. The reason is that we do backoff polling of the queues in the Durable Task hub to avoid huge storage account charges to you if you're not frequently running orchestrations. If no orchestrations have been run on a particular VM instance in a while, it can take up to 30 seconds for the message to get picked up. I think that's what happened here.

There are a few optimizations we have in place to try to reduce this. For example, if your function app is only scaled out to a single instance, we can remove this latency by telling the queue polling loop to poll right away after we detect someone called the StartNewAsync API. We can't always do this when your app is scaled out, though, because only certain VMs are allowed to run certain orchestrations based on a partitioning logic we've come up with (more details in our Performance & Scale documentation).

In your case, I see that your HTTP trigger scheduled this orchestration to run, but a different VM was responsible for running it, and that VM likely hadn't run any orchestrations for a while.

If the maximum 30 seconds delay is problematic for you, we could look into making this configurable.

I get the point about the cold start that it might be up to 30 sec and I think it's something we can live with.

But what I described (6 sec delay) looks like different case (not a cold start) since I can see from logs that instance c5ebb91e036a688675d9c1e68ffd5bf888fffe4ddd19de33f4ad2a13b18f9128 was active like 14 sec before the last (delayed) execution.

Real use-case

I use durable functions for a chat system.
The design of this system is the same as in repository I shared in previous comments - there is a Http trigger to initiate sending a message to chat and an activity function that actually sends this message via web sockets.
Users of this chat are experiencing the following behavior (consider we made some API calls beforehand to prevent a cold start):

  1. User sends a message - message is delivered in 100 ms
  2. User sends a message - message is delivered in 150 ms
  3. User sends a message - message is delivered in 6 sec
  4. User sends a message - message is delivered in 4 sec
  5. User sends a message - message is delivered in 90 ms
  6. User sends a message - message is delivered in 5 sec

etc. so even after a dozens of http requests they still experience these delays.

The long delivery time in some cases (several seconds) is caused by not immediate start of the orchestration. So it looks unacceptable to have such delays for this use case.

From what are you saying I understand that these random delays are caused by backoff polling of the queues in the Durable Task hub to avoid huge storage account charges and that I can avoid polling using only one instance. But in my case I definitely need more than one instance (to make the solution scalable) as well as quick orchestration starting. Maybe there are some configurable settings for this polling period?

Hi @cgillum,

I have been working on this case with Azure Support engineers. They confirmed that these ~6 seconds delays happened quite frequently and that they are cased not by the cold start. The resolution of this ticket is:

As I checked with our internal team the Durable Function is mostly useful to define stateful workflows. It can manages state, checkpoints, and restarts for you.
And it may be more suitable for the system which not require real time update very sensitive.

So I wonder, are Durable Functions really supposed to be used only in the systems that are ok with several seconds delays? If it's true than it should be reflected in the documentation somewhere (or maybe it is already?) since it can save quite amount of time for developers like me who don't consider that such delays are even possible.

And another question: maybe there is any chance that we can decrease polling period to prevent these delays (I understand that it increases storage account charges but still)? I just don't get how come that regular Azure Functions trigger on queue storage messages almost immediately but Durable Functions are not.

Hi @myarotskaya. Yes, these delays are not by polling delays and not by cold-starts.

are Durable Functions really supposed to be used only in the systems that are ok with several seconds delays?

Yes. Generally speaking, latency should always be assumed when using systems that are based on queues. These delays can be caused by:

  1. Polling latencies, which we have discussed here.
  2. Backlogs - i.e. the queues getting too full and not enough consumers to process all the messages quickly.

Some queueing systems can mitigate the delays caused by polling by supporting long-polling. Azure Service Bus supports this, for example. However, Azure Storage (which Durable Functions current depends on) queues do not.

In the short term, we can definitely introduce a configuration knob to decrease the polling period. I've opened this issue to track that work. If you are interested in working with us to test it out, we can prioritize it higher.

Longer term, we're looking into supporting alternative storage providers that have different performance guarantees, such as Service Bus, Event Hubs, or Redis. The cost characteristics may be quite different and we started out with Azure Storage because of the consumption billing model and because it's already a requirement for function apps. But increasingly we're seeing a need for alternate providers, primarily due to performance concerns like this one.

I just don't get how come that regular Azure Functions trigger on queue storage messages almost immediately but Durable Functions are not.

The most likely answer is that regular queue storage triggers in Azure Functions are stateless. This means that as you scale out, more VMs are polling the same queue, making it more likely that one of them will pick up the message quickly. With Durable Functions, the control queues are essentially stateful, which means only ever be polled by a single VM. Because of this, scaling out actually increases the likelihood of delays since the VM polling the queue does not know when another VM adds a message to that queue.

@cgillum thank you for the explanation! I definitely interested in working on the issue you created, please let me know how I can help.

Hi @myarotskaya, I went ahead and put this in our Triage milestone. That means it will be in our near-term backlog but we haven't yet identified a particular release time. We accept PRs, so if you're interested in contributing, I added some instructions here: https://github.com/Azure/azure-functions-durable-extension/issues/618 (unfortunately it requires making changes in two different repos). Otherwise we can try to get to it relatively soon.

@cgillum thanks for the instructions! I've made #627 to resolve the second step from #618. Could you please take a look?

Closing this issue, now that we have a workaround as well as documentation for this behavior.

Was this page helpful?
0 / 5 - 0 ratings