Azure-functions-durable-extension: Duplicate activity execution can result in stuck orchestrations

Created on 24 Apr 2019 · 2Comments · Source: Azure/azure-functions-durable-extension

Issue

It has been observed that duplicate activity executions can result in orchestrations getting stuck. There are a few causes of duplicate activity function executions. Most are related to heavy load, particularly in environments where there is a lot of partition movement. Some known issues in the Durable Task Framework layer:

There is a specific case which shows what happens under the correct conditions (note that this data is only visible to authorized Microsoft employees):

// East US
let instId = 'b45d7d108602447dbfd9519f57b6b917';
DurableFunctionsEvents
| where TIMESTAMP between (datetime(2019-04-18 23:00:00) .. 1h)
| where InstanceId == instId or TargetInstanceId == instId or instanceId == instId 
| where ProviderName in ('DurableTask-AzureStorage', 'DurableTask-Core')
| extend ActId = substring(ActivityId, 0, 8), RelActId = substring(RelatedActivityId, 0, 8), ExId = substring(ExecutionId, 0, 4), MsgId = substring(MessageId, 0, 8) 
| project TIMESTAMP, EventStampName, RoleInstance, Pid, Tid, ActId, RelActId, Level, MsgId, ExId, ProviderName, TaskName, message, EventType, TaskEventId, PartitionId, Age, ETag, NewEvents, DequeueCount, Details, LatencyMs, TaskHub, ExtensionVersion

There seems to be another issue where the orchestration fails to actually fail, resulting in messages which get replayed 5-10 minutes. Users may observe messages in their control queues with very large dequeue counts when their orchestrations get into this state.

Fix

The fix will need to be in the DurableTask.AzureStorage nuget package, which is where the internal messaging layer is implemented. The code here will need to ensure that the TaskCompleted events are de-duped before writing them to the history table.

Workaround

There is no workaround for this issue. However, it can be avoided by running orchestrations in stable environments that are less likely to encounter duplicate activity executions - e.g. minimizing partition movement by either having a single partition or by running apps across a set of VMs that do not change frequently.

bug dtfx

Source

cgillum

All 2 comments

We observed this last night; I am happy I didn't have to open this bug report myself, because I would have sounded much more uncertain about what the issue at hand actually was. :)

In our case, one of the triggered activities failed with an exception in my code, and the other failed with a generic "The wait operation timed out" (which I believe came from my code). The orchestration itself didn't fail immediately, and hung until 6 hours after its CreatedTime (I assume there's a 6 hour orchestration timeout?).

In case it helps having a second instanceID to look at, ours was cd1cd793-56cf-47ea-aa8b-b4c34405d579. Our function app was running in the Consumption tier, so I'm not sure we have as much control over that environment as your workaround offers. Ignoring side-effects of Consumption tier, I believe we had no more than 3 activity functions running simultaneously (we likely had 3 orchestrations running, but I think most were asleep on timers) when this issue manifested. Our activity functions themselves should not have long runtimes, so I wouldn't expect our app service instance itself would have been under heavy load.

Feel free to reach out if there's any other context I can provide, I'm MSFTInternal. :)