Azure-functions-durable-extension: Durable Entity signaling duplication at scale

Created on 28 Jul 2020 · 10Comments · Source: Azure/azure-functions-durable-extension

Description

We're finding what appears to be race conditions between different host instances that are processing a signal in a single Durable Entity when operating at scale.

The fact that a particular Entity instance is running over multiple function host instances is concerning - I thought this shouldn't happen since the entity name/key are hashed to a single control queue but I could be mistaken.

What we're observing is:

Signal to HostInstance1, Entity1
1. adds a to existing List in entity: [a]
Signal to HostInstance2, Entity1
1. adds b to existing List in entity: [a, b]
Signal to HostInstance1, Entity1
1. adds c to existing List in entity: [a, c]

Now there is a 'gap' in the list, where b is not there.

Context

We are using Durable Entities to do store+forward of audio from online meetings. We record audio at 50 fps, where each frame is 20ms.

We have a separate recorder service sending each frame immediately to service bus (no send batching yet). So a 15 minute meeting is 45k messages.

A ServiceBusTrigger'd Azure Function consumes these messages, using batch and session capabilities, then sends the batch directly to a durable entity corresponding to the meeting where we quickly add each message in the batch to a list. After the meeting ends we do some more expensive processing on it.

Expected behavior

Durable Entities is supposed to be processing messages serially, so we relied on this for aggregating audio chunks. We expected an Append signal to a Durable Entity to be processed against the latest state of that entity.

Actual behavior

In most simple cases this works fine, but when we have 5 concurrent 3 minutes meetings (9,000 messages for each meeting in Service Bus, which corresponds to batches of variable sizes signaled to a Durable Entity for each meeting) we start observing this behavior where, e.g.:

A HostInstance1 will process a signal message of batch size 10 and add to a list of length 2000, resulting in a list of length 2010
A HostInstance2 will process another signal message of batch size 10 and add to a list of length 2000, resulting in a list of length 2010 <- this is wrong, it should have added to the latest list of length 2010
A HostInstance1 will process a signal message of batch size 10 and add to a list of length 2000, resulting in a list of length 2010

We will eventually find a gap in the resulting data.

Relevant source code snippets

From a session-enabled, batch-enabled ServiceBusTrigger we are signaling the entity.

            private async Task ProcessContent(IDurableEntityClient entityClient)
            {
                await entityClient.SignalEntityAsync<IEncounterEntityMessageHandler>(
                    GetEntityId(),
                    proxy => proxy.Append(_content!)).ConfigureAwait(false);
            }

host.json:

{
  "version": "2.0",
  "extensions": {
    "durableTask": {
      "storageProvider": {
        "connectionStringName": "AzureWebJobsStorageDurable",
        "partitionCount": 16,
        "maxQueuePollingInterval": "00:00:03",
        "controlQueueVisibilityTimeout": "00:01:00"
      },
      "useGracefulShutdown": true
    }
  },
  "functionTimeout": "00:10:00",
  "logging": {
    "logLevel": {
      "default": "Information"
    }
  }
}

I can send the full code - I also work at MS so I can provide more details here.

Known workarounds

Verified that if we reduce the max instances down to 1 we don't see this issue - but this is not a scalable solution for us.

App Details

Durable Functions extension version (e.g. v1.8.3): 2.2.2
Azure Functions runtime version (1.0 or 2.0): 2.0
Programming language used: C#

Screenshots

An initial Signal to the entity to append a batch of chunks

Via host instance 7598b32f-ed22-4fcd-a068-1db8385d071d
Operation ID: 4ad996c48780b54e86c438467b6db512

A few milliseconds later, we continue to add chunks

Via the same host instance 7598b32f-ed22-4fcd-a068-1db8385d071d
Operation ID: 4ad996c48780b54e86c438467b6db512

Then we see a trace to add new chunks to an old list.

This is on a different host instance 2b60a848-d12d-4b8f-85c8-a922c6e45635
Operation ID: db216e7e41255444bd7311091a98b505

If deployed to Azure

We have access to a lot of telemetry that can help with investigations. Please provide as much of the following information as you can to help us investigate!

Timeframe issue observed: Start (UTC): 2020-07-28T13:56:04.1813235Z. End (UTC): 2020-07-28T13:56:04.4366077Z
Function App name: daxteamsdev-adi-uploadprem
Function name(s): ProcessAudio
Azure region: westus2
Orchestration instance ID(s): @encounterentity@411f1300-401f-4fcc-90b3-c51fc7a09c81
Azure storage account name: daxteamsdevadiuploadv2

For more info, I'm at Microsoft, alias: adunnith

If you don't want to share your Function App or storage account name GitHub, please at least share the orchestration instance ID. Otherwise it's extremely difficult to look up information.

bug dtfx

Source

adiun

Most helpful comment

@anel-al sorry for the delayed response on this. The core work to mitigate this issue has been completed. We just need to add another PR to allow customers to turn it on (the fix will be opt-in initially so we can get some additional validation before turning it on by default). It will be available in our next release which hopefully will go out by the end of this week, or next week at the latest.

cgillum on 20 Aug 2020

👍2

All 10 comments

@adiun thanks for reaching out and reporting this. You are correct that entities get hashed to a single control queue. One problem that is known to happen occasionally is that partition ownership will change as the app gets allocated to more VMs. These partition transitions are currently uncoordinated, which means that two VMs may temporarily think they own a partition at the same time, resulting in an entity being processed by two instances concurrently. We're actively testing a fix for this and hope we can resolve this in our next release. The issue is being tracked here: https://github.com/Azure/durabletask/issues/410

You mentioned that you could work around it by setting the max number of instances to one (which isn't scalable). Another temporary workaround you could try running your app in a dedicated App Service plan which a fixed number of VM instances > 1. This problem occurs when adding new VMs while entities or orchestrations are actively processing messages. If you always use a fixed number, then I expect you won't see this.

@ConnorMcMahon you may want to take a quick look at this one to confirm whether this is indeed a split-brain issue (and whether your fix would address it).

cgillum on 28 Jul 2020

Thanks @cgillum for the quick response! I tested another workaround at your suggestion with Premium plan, 5 min/max/warmed-up instances with short meetings (5 minutes = 15k messages) and long meetings (20 min = 60k messages) and it worked reliably without this error occurring.

The occasional split-brain issue would explain why we are seeing this for a relatively small number of chunks and at scale.

For our user scenarios we are hoping for scale-to-zero, so we would definitely benefit from the partition management fix if that's indeed the fix for this issue.

adiun on 29 Jul 2020

Awesome, I'm glad that workaround is working so far and I definitely understand the desire to scale to zero. I'm hopeful we can get a fix out and available to you relatively soon.

cgillum on 29 Jul 2020

👍1

Sounds like the same split-brain issue as here : https://github.com/Azure/azure-functions-durable-extension/issues/1131

olitomlinson on 29 Jul 2020

Hi,

Just to confirm that we are running into same problem. When processing child items for a large parent mass action and when tracking status update counts, number of received status update signals on the parent entity does not match number of items.

I am testing with 10K of items, each running processing activity and sending 2 status update messages to the item's entity and common aggregator entity. Limited scale to 4 instances.

It looks like reducing number of instances to 1, puts machine on pressure where aggregator state update is heavily delayed in favor of running activities. So, while reducing number of instances solves issue of incorrect signal count, it looks like it introduces additional one, where snapshot of aggregator state is of limited use for monitoring.

Entity which contains list of items (static data) is separated from the entity containing runtime details (frequent updates) due to entity size. From what I could see while testing, it looks like correct size of the entity (which is heavily updated) affects performance a lot. Is this the case?

Currently, I am writing small POC evaluating use of entities for our business case and this issue could be potential show stopper. Can we rely on fix for this issue in the next version?

Thank you,
Anel

anel-al on 6 Aug 2020

cgillum on 20 Aug 2020

👍2

Thank you cgillum.

anel-al on 24 Aug 2020

New configuration setting has been merged (see https://github.com/Azure/azure-functions-durable-extension/pull/1445).

cgillum on 25 Aug 2020

We might be experiencing the same issue and would like to try to switch to the new partition management strategy. I couldn't find any information on configuration. How do you enable the new strategy?