Poison queue messages should be impossible if DurableTask.AzureStorage is implemented correctly, but there is no way to 100% guarantee this. The Azure Storage orchestration service should therefore have some kind of poison message handling to avoid generating unlimited amounts of log churn, storage I/O, and potentially billing charges.
Related issue: https://github.com/Azure/azure-functions-durable-extension/issues/334
There are three different ways we could handle poison messages:
Another interesting scenario: if an activity function runs in the consumption plan and takes longer than 10 minutes to execute, this could result in an infinite crash loop. This is because the functions host will recycle to kill the long-running activity function. When the host restarts, it will try to re-execute that activity function and fail again (and this will continue indefinitely).
Wondering if the team has any sort of timeline on this issue? Is it on the radar for the next release at all? We have a situation where our functions are prone to infinite crash loop and its producing a lot of errors in our customers error logs.
No specific timelines to share yet, other than that it won't be available for the v1.7 release (which is next). More likely it will make it's way into a v2.0 release.
Because any message is just one of a sequence of internal and inter-related messages, it's not really sufficient for us to create a simple "poison queue" like we would for a regular queue-trigger function. Instead, I suspect we'll need to introduce a new "suspended" state for orchestrations and come up with an experience for detecting, pausing, and resuming suspended orchestrations. This will need to be a new concept which is supported by the Durable Task Framework.
Just out of interest, we implemented a workaround where we threw an exception from runaway function on a table storage lookup.
This at least stops the runaway function for running for days on end.
Most helpful comment
Another interesting scenario: if an activity function runs in the consumption plan and takes longer than 10 minutes to execute, this could result in an infinite crash loop. This is because the functions host will recycle to kill the long-running activity function. When the host restarts, it will try to re-execute that activity function and fail again (and this will continue indefinitely).