Azure-functions-durable-extension: Orchestration function - ActivityTrigger (fan-out/fan-in) returns wrong result (duplicated result from other activity trigger)

Created on 27 Oct 2020  路  3Comments  路  Source: Azure/azure-functions-durable-extension

Description

This bug is occurring randomly once in couple of runs for the function app which is under the consumption plan.
We use fan-out/fan-in pattern in order to split a long running process. We created an activity function which takes set of data, and process it. In the orchestration level we create the chunks and call activity function to execute in parallel.

We use Task.WhenAll(...) to wait for all the activity triggers to complete and gather the results. We included the logs in activity functions and all are executing perfectly and matching the expected data.

But sometimes one or more activity triggers returning an unexpected result, which is actually the same result of another activity trigger. So at the end of the fan-in, we end up with one duplicate results, and one missing result.

(In below C# code in the in the for loop we see the result from two tasks are same - complete code in - code section)
foreach (var item in parallellTasks)
{
//add the result from every activity function
outPut += item.Result;
}
}

Note: Seeing this bug more frequently in Consumption Pricing tier than Premium tier.

consumption : approximately occurring 1 in 4 runs
premium tier : approximately occuring 1 in 10 times

Expected behavior

Total : 6855 data models, we split this in 69 parallel activity triggers (each activity trigger 100 data models), and expect to receive 69 unique results.

Actual behavior

simple scenario : We receive 69 results, but one is duplicate, and therefore one result is missing.

example :
activity trigger 1
activity trigger 2
activity trigger 3
.....

we see the results appeared as below -
activity 3 result
activity 2 result
activity 3 result
so on ...

complex scenario : some times, we receive 69 results but 3 to 6 results are duplicates, hence we miss the original result which already duplicated.

This is randomly occurring issue, and the duplicates are also random.

each activity result is a large data (300 to 1200 kbs) so they are stored in the functionname-largemessages blob container.
For the faulty orchestration, if we correlate the message results ("history-history-000000000000007E-TaskCompleted-Result.json.gz", "history-0000000000000049-TaskCompleted-Result.json.gz") both has the same data.
The file name changed but the data is exact copy of other one. (these are the message data stored in blob)

Relevant source code snippets

//Orchestration function: We receive 6855 DataModels data, and each chunk has 100 models, which calls 69 parallel activity //triggers :

 [FunctionName("Compute")]
        public static async Task<OutputModel> RunOrchestrator(
           [OrchestrationTrigger] IDurableOrchestrationContext context, ILogger log)
        {
            var sessionContextObject = context.GetInput<SessionContext>();
            var inputData = inputDataSerializerService.Deserialize(sessionContextObject.InputModel);
            OutputModel outPut = new OutputModel();

            if (inputData != null)
            {
                int numberOfDataModels = inputData.DataModels.Length;
                outPut.NumberOfDataModels = numberOfDataModels;
                int numberOfDataModelsPerChunk = Math.Min(numberOfDataModels, 100);

                var parallellTasks= new List<Task<OutputModel>>();

                try
                {
                    for (int i = 0; i < numberOfDataModels; i += numberOfDataModelsPerChunk)
                    {                        
                        if (i + numberOfDataModelsPerChunk > numberOfDataModels)
                        {   
                            numberOfDataModelsPerChunk = numberOfDataModels - i;
                        }

                        DataModel[] dataModelSet = new DataModel[numberOfDataModelsPerChunk];
                        Array.Copy(inputData.DataModels, i, dataModelSet, 0, numberOfDataModelsPerChunk);
                        var serializedDataModels = inputDataSerializerService.SerializeDataModelArray(dataModelSet);
                        Task<OutputModel> task = context.CallActivityAsync<OutputModel>("ChunkCompute", serializedDataModels);
                        parallellTasks.Add(task);
                    }

                }
                catch (Exception ex)
                {
                    log.LogCritical(ex.Message);
                }

                await Task.WhenAll(parallelProdTasks);
               //we see the issue while aggregating the result after all the activity execution (results duplicated)
                foreach (var item in parallellTasks)
                {
                    //add the result from every activity function
                    outPut += item.Result;
                }

                parallellTasks.Clear();
            }

            return outPut;
        }

       //Activity Function:

        [FunctionName("ChunkCompute")]
        public static async Task<OutputModel> ChunkCompute([ActivityTrigger] string dataModelData, ILogger log)
        {
            return await Task.Run(() =>
            {
                var dataModels = inputDataSerializerService.DeserializeToDataModelArray(dataModelData);
                var result = new OutputModel();

                for (int i = 0; i < dataModels.Length; i++)
                {
                    try
                    {
                        var dataModelOutput = dataModels[i].ProcessData(); //which calls internal code (referenced dll) for process
                        dataModelOutput.NumberOfSuccessfulDataModels = 1;
                        result += dataModelOutput;
                    }
                    catch (Exception ex)
                    {
                        log.LogError(ex.Message);
                    }
                }
                return result;
            });
        }

Known workarounds

Retry the function (orchestration) execution

App Details

  • *Durable Functions extension version *: 2.2.2
  • *Azure Functions runtime version *: 3.0.9
  • Programming language used: C#

Screenshots

N/A

If deployed to Azure

We have access to a lot of telemetry that can help with investigations. Please provide as much of the following information as you can to help us investigate!

  • Timeframe issue observed: 2020-10-26T14:07:29.517Z
  • Function App name: N/A
  • Function name(s): N/A
  • Azure region: US South Central
  • Orchestration instance ID(s): a633fb05663a404290920cb4153309d7, (160b269892694f7d80eb0064fa12c8fe - multiple duplicates)
  • Azure storage account name: N/A

a633fb05663a404290920cb4153309d7 instance other details:
DurableFunctionsInstanceId - a633fb05663a404290920cb4153309d7
Invocationid - 30b268e7-0c78-477f-9047-a99493bf03fe
executionid - 8d870d42df7f4808a7f36b6516fb9fa1
Partitionkey - a633fb05663a404290920cb4153309d7

160b269892694f7d80eb0064fa12c8fe instance other details:
DurableFunctionsInstanceId - 160b269892694f7d80eb0064fa12c8fe
Invocationid - bcdbb631-67a5-45b4-824e-4ac1974b3fb6
executionid - 363de04206c24fddaeedf8a540141feb
Partitionkey - 160b269892694f7d80eb0064fa12c8fe

Needs

Most helpful comment

@cvanama

Unexpected duplicates trigger me to think that this could be a split-brain issue, however I'm not an authority on this so I could be wrong.

I notice you are using 2.2.2 of the Durable Function extension which contains known split brain issues (any version prior to 2.3.x contains the issue)

You could upgrade to 2.3.0 and set "useLegacyPartitionManagement" : false and this will put you on the new strategy which specifically aims to prevent split-brain. example of the confiig setting here.

Alternatively, you could up grade to the latest (2.3.1) and you don't need to set the above configuration, as its on by default, whatever you are most comfortable with. But obviously I would recommend 2.3.1.

All 3 comments

@cvanama

Unexpected duplicates trigger me to think that this could be a split-brain issue, however I'm not an authority on this so I could be wrong.

I notice you are using 2.2.2 of the Durable Function extension which contains known split brain issues (any version prior to 2.3.x contains the issue)

You could upgrade to 2.3.0 and set "useLegacyPartitionManagement" : false and this will put you on the new strategy which specifically aims to prevent split-brain. example of the confiig setting here.

Alternatively, you could up grade to the latest (2.3.1) and you don't need to set the above configuration, as its on by default, whatever you are most comfortable with. But obviously I would recommend 2.3.1.

+1 to what @olitomlinson said. Please try upgrading to v2.3.1 and let us know if that resolves the issue (or not).

Thanks for the suggestion @olitomlinson @cgillum, we upgraded the version and did couple of runs, as of now and we don't see this issue repeating in our dev envrionment. If we see this issue again i will create new thread with the current issue number.

thanks for your timely support and help. I am closing this for now.

Was this page helpful?
0 / 5 - 0 ratings