Hi
I'm having issues with the RewindAsync.
Running the orchestration works fine, but when triggered though RewindAsync it fails with "Non-Deterministic workflow detected: TaskScheduledEvent: 5 TaskScheduled DispatchSignalREvent".
DispatchSignalREvent is the next activity it would have triggered normally.
Would be great to get some input. I've been "banging my head" over this issue for a couple of days now.
That's interesting. I'd like to try reproducing your issue - would you be willing to provide your orchestrator's code? (This can be in an "anonymized" form - what I'm interested in are the types of activities you're scheduling (CallActivityAsync, CallSubOrchestratorAsync, WaitForExternalEvent, etc), which ones are failing and at what point you're trying to rewind.
Hi @kashimiz ! Sorry about the slow response.
Here is the code affected. RewindAsync() fails with "Non-Deterministic workflow detected: TaskScheduledEvent: 5 TaskScheduled DispatchSignalREvent" after "await context.CallActivityAsync(FN.UpdateDb, new UpdateDbRequest".
I've also discovered _another issue_. If we change DEFAULT_EXTERNAL_EVENT_TIMEOUT from 3 days to 7 days, it fails saying it has expired. Any clues?
private static readonly TimeSpan DEFAULT_EXTERNAL_EVENT_TIMEOUT = TimeSpan.FromDays(3);
[FunctionName("MainFlow")]
public static async Task Run([OrchestrationTrigger]DurableOrchestrationContextBase context, TraceWriter log)
{
string applicationReference = null;
try
{
var application = context.GetInput<App.Application>();
application.IdNo = application.IdNo.GetNumbersOnly();
applicationReference = application.Reference;
var decisionResult = await context.CallActivityAsync<ValidationResult>("ValidateFunc", application);
context.SetCustomStatus(decisionResult.ErrorMessages);
var customer = Mapper.MapFrom(application, decisionResult);
customer = await context.CallActivityAsync<Customer>("CreateCustomer", customer);
application.CustomerId = customer.Id != Guid.Empty
? customer.Id.ToString()
: throw new ArgumentNullException("CustomerId");
var defaultProduct = await context.CallActivityAsync<ProductType>("GetProduct", application.RefId);
var opportunity = new Opportunity(application.CustomerId, defaultProduct)
{
Name = $"{customer.FirstName} {customer.LastName}"
};
opportunity = await context.CallActivityAsync<Opportunity>("CreateApplication", opportunity);
application.ApplicationId = opportunity.Id != Guid.Empty
? opportunity.Id.ToString()
: throw new ArgumentNullException("ApplicationId");
await context.CallActivityAsync(FN.UpdateDb, new UpdateDbRequest
{
OpportunityId = application.ApplicationId,
DecisionResult = decisionResult
});
await context.CallActivityAsync(FN.DispatchSignalREvent,
new SignalRMessage(SIGNALR_HUB, EVT_VALIDATION_SUCCEEDED, applicationReference, JsonConvert.SerializeObject(new { applicationId = application.ApplicationId })));
var verifiedEvent = await context.WaitForExternalEvent<VerifiedEvent>(EVT_VERIFIED, DEFAULT_EXTERNAL_EVENT_TIMEOUT);
if (verifiedEvent == null)
{
throw new Exception("Verification failed");
}
//.....................
//.....................
}
catch(Exception ex)
{
log.Error($"({context.InstanceId}) FAILED!{Environment.NewLine}{ex.Message}", ex);
if (!string.IsNullOrWhiteSpace(applicationReference))
{
await context.CallActivityAsync(FN.DispatchSignalREvent, new SignalRMessage(SIGNALR_HUB, EVT_ERROR, applicationReference, ex.Message));
}
throw;
}
}
I have a theory, please correct if wrong.
Since we use try/catch around the logic. If it throws inside "FN.UpdateDb", the catch will call activity "FN.DispatchSignalREvent" with "EVT_ERROR". Then we fix the "FN.UpdateDb" and try rewinding using the RewindAsync method, it will then proceed to "FN.DispatchSignalREvent" with "EVT_VERIFIED", and fail with "Non-Deterministic workflow detected", since same activity already ran, but with different signature.
Hi ,
any update on this issue, we have the same problem.
Hi @rupakraj6 No response, we are also still waiting.. Are there any news @cgillum @kashimiz ?
My apologies for the delay in getting back to you on this, @pvujic .
On RewindAsync / NonDeterministicWorkflowException
I was able to reproduce your issue and your hypothesis is mostly correct: because your orchestrator catches the exception thrown by FN.UpdateDb and schedules another activity function before re-throwing the exception, the rewind process is getting tangled up, causing the non-deterministic workflow exception you see.
However, this happens regardless of whether you call FN.DispatchSignalREvent with EVT_VERIFIED after calling FN.UpdateDb; the orchestrator never reaches this second FN.DispatchSignalREvent call or replays FN.UpdateDb on the replay triggered after it's rewound. This is because the rewind process's cleanup phase fails to scrub the history events of the first FN.DispatchSignalREvent, so they remain and confuse the Durable Task Framework.
In summary: right now, in order for the rewind process to work, the failed step in the orchestrator must be the last step executed before the orchestrator itself fails. This is due to a logical oversight in our implementation. I'm sorry this bug is impacting you, but thank you for discovering it and bringing it to our attention.
On WaitForExternalEvent Timeouts
Regarding your WaitForExternalEvent call failing when you set your timeout to 7 days, this is likely due to a limitation imposed by our use of Azure Storage to queue events. Because Azure Storage queues can only hold messages for 7 days, CreateTimer and WaitForExternalEvent must time out prior to 7 days. We document the limitation for timers here but it looks like our external event documentation is missing that disclaimer; I'll get that added.
We do have a feature request out to extend this limitation beyond 7 days. Mark Heath has come up with a workaround for the timer scenario you may be able to adapt to the WaitForExternalEvent case.
Thank you for the response. It clarifies much.
Reactivating so that we don't forget to actually fix this.
I am able to rewind if i use try catch at ActivityFunction level and throw the exception from there.
Can anyone suggest is it good practice to use the same ?
Most helpful comment
Reactivating so that we don't forget to actually fix this.