Azure-functions-durable-extension: Sub-orchestration fan-out/fan-in occasionnally returns wrong result

Created on 18 Sep 2019 · 7Comments · Source: Azure/azure-functions-durable-extension

Description

This bug has been occurring randomly for a few months in one of our Function app and has been really hard to pin down.

We use a fan-out/fan-in pattern in order to split a job into multiple sub-orchestrations. We use Task.WhenAll(...) to wait for all the sub-orchestrations to complete and gather the results.
Apparently, sometimes one of the sub-orchestrations will return an unexpected result, which is actually the same result as another sub-orchestration. So at the end of the fan-in, we end up with one duplicate result, and one missing result.

Expected behavior

We split a job in 17 parallel sub-orchestrations, and expect to receive 17 unique results.

Actual behavior

We receive 17 results, but one is a duplicate, and therefore one result is missing.

Our results are large (a few KBs) so they are stored in the optimizerhub-largemessages blob container.
For the faulty sub-orchestration, if I correlate the message-{someid}-suborchestrationinstancecompleted.json.gz blob with the corresponding history-{rowkey}-suborchestrationinstancecompleted-result.json.gz blob, the content don't match.
The content of the history-{rowkey}-suborchestrationinstancecompleted-result.json.gz blob is actually a copy of another sub-orchestration result.

Known workarounds

Retry the orchestration.

App Details

Durable Functions extension version: 1.8.0
Azure Functions runtime version: 2.0
Programming language used: C# (compiled)

Screenshots

N/A

If deployed to Azure

Timeframe issue observed: September 16th 21:08 UTC
Function App name: Can't share
Function name(s): "Optimization_Orchestrator", "Optimization_SubOrchestrator"
Azure region: Canada East
Azure storage account name: Can't share
Orchestration instance ID(s): 043e8885b43b4c41ae8655fbf6945d54

More details:

Task Hub: Optimizer Hub
Relevant blobs:

These blobs should contain the same result but they don't.

043e8885b43b4c41ae8655fbf6945d54/message-000000000000006c-suborchestrationinstancecompleted.json.gz

043e8885b43b4c41ae8655fbf6945d54/history-000000000000002c-suborchestrationinstancecompleted-result.json.gz

Instead, the result blob is a duplicate of:

043e8885b43b4c41ae8655fbf6945d54/history-000000000000002f-suborchestrationinstancecompleted-result.json.gz

bug fix-ready

Source

Costo

Most helpful comment

@Costo

I have found a bug in the DurableTask.AzureStorage project in the way that large message blob names are created. The sequence number is what is used to construct the someid section of the blob name message-{someid}-suborchestrationinstancecompleted.json.gz. This unfortunately not guaranteed to be unique, as each instance of your application keeps track of it's own sequence number.

This means that it's possible for messages sent from different VMs to have the same naming scheme for blobs, which can cause the suborchestration that finishes later to overwrite the already finished message of another suborchestration before it uploads its content in the history table. This is why you are seeing this mismatched behavior.

I have a PR already to address this, and it will go into our next release.

Thanks for your patience with this issue.

ConnorMcMahon on 17 Oct 2019

🎉4

All 7 comments

We put a mechanism in place to detect the problem. We pass a SubOrchestrationIndex property to each sub-orchestration and expect to receive the same value in the output. If we receive a different value, it means there was a mix-up in the sub-orchestration results.

This has not happened yet. We will monitor the Function App and I'll update this issue if we detect the problem again.

Here's the code of we put in place:


var subOrchestrationTasks = new Task<Optimizer.Output>[problems.Length];
for (var i = 0; i < problems.Length; i++)
{
    var problem = problems[i];
    var task = context.CallSubOrchestratorWithRetryAsync<Optimizer.Output>("Optimization_SubOrchestrator", 
        new RetryOptions(TimeSpan.FromSeconds(10), 3), 
        new SubOrchestrator_Optimization.Input
        {
            SubOrchestrationIndex = i,
            Problem = problem,
            ...
        });
    subOrchestrationTasks[i] = task;
}

var optimizerResults = await Task.WhenAll(subOrchestrationTasks);

// We detect sub-orchestration results mix-up in the following loop 
for (var i = 0; i < problems.Length; i++)
{
    var result = optimizerResults[i];

    if (result.SubOrchestrationIndex != i)
    {
        throw new Exception($"Wrong SubOrchestration result! Expected {nameof(result.SubOrchestrationIndex)} to be {i}, but was {result.SubOrchestrationIndex} instead");
    }
}

Costo on 18 Sep 2019

Thanks for the detailed analysis. Just to clarify, the problem that's impacting you specifically is that you're getting the wrong set of results (as described) when fanning in on your parent orchestration, right?

cgillum on 19 Sep 2019

Thanks for the detailed analysis. Just to clarify, the problem that's impacting you specifically is that you're getting the wrong set of results (as described) when fanning in on your parent orchestration, right?

Yes, that is correct.

As mentionned above, we put a piece of code in place that just throws if the problem is detected.
In the last 24h, we detected this problem 12 times for 3 of our tenants, for a total of ~280 orchestrator executions.

Sample error messages:

'Optimization_Orchestrator' failed: Wrong SubOrchestration result! Expected SubOrchestrationIndex to be 7, but was 3 instead. TenantId: [tenant1]
'Optimization_Orchestrator' failed: Wrong SubOrchestration result! Expected SubOrchestrationIndex to be 30, but was 27 instead. TenantId: [tenant2]
'Optimization_Orchestrator' failed: Wrong SubOrchestration result! Expected SubOrchestrationIndex to be 0, but was 15 instead. TenantId: [tenant3]

Costo on 20 Sep 2019

@Costo

Can you share how you are generating the value of problems? Generally, when we see errors where the values of different Task objects are swapped, it indicates a non-deterministic code pattern, where non-deterministic behavior in the orchestrator is causing the code to replay differently then it did the on the execution when the Tasks were originally scheduled.

ConnorMcMahon on 7 Oct 2019

@ConnorMcMahon

The problems array is the result of a previous activity. Is it possible that the order of the items in the array is non-deterministic ?

Costo on 13 Oct 2019

@Costo

I have a PR already to address this, and it will go into our next release.

Thanks for your patience with this issue.

ConnorMcMahon on 17 Oct 2019

🎉4

Removed comment, after reviewing it appears that the issue is different enough IMO to open a separate issue.

antempus on 24 Oct 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Orchestration function - ActivityTrigger (fan-out/fan-in) returns wrong result (duplicated result from other activity trigger)

cvanama · 3Comments

Singleton orchestrator starts multiple times

tommasobertoni · 3Comments

NullReferenceException when Task returns void

mark-szabo · 3Comments

What is the expected behavior if we are waiting multiple times on the same event name, but with a different shape of data?

SimonLuckenuik · 3Comments

App Lease allows all apps to process activity messages from the work item queue

amdeel · 3Comments