Azure-functions-durable-extension: Internal poison message handling

Created on 11 Jun 2018 · 4Comments · Source: Azure/azure-functions-durable-extension

Poison queue messages should be impossible if DurableTask.AzureStorage is implemented correctly, but there is no way to 100% guarantee this. The Azure Storage orchestration service should therefore have some kind of poison message handling to avoid generating unlimited amounts of log churn, storage I/O, and potentially billing charges.

Design Proposal

There are three different ways we could handle poison messages:

None: The current behavior, which is to keep retrying indefinitely. This should be default for 1.x but not for 2.x. This is problematic because it could add never-ending overhead to the application.
Discard: Drop a message which appears to be a poison message - i.e. it has a DequeueCount of 100 or more (configurable). Depending on the type of message (e.g. an activity function return value message), this could result in an orchestration instance getting permanently stuck, which is potentially a data loss scenario.
Save: Save the poison message into storage for later processing. This option could have two sub-options: a) suspend the orchestration instance (yet another new feature) until the message is dealt with or b) allow processing other messages for the instance. A command could be exposed that attempts to dequeue all poison messages when the user is ready to try again. Orchestrations don't generally depend on ordering from the queue (they manage their own ordering), so it _should_ be okay to process other queue messages out of order. However, orchestrations could get stuck waiting for messages in the poison queue, and we would need to consider signaling this in the orchestration status.

bug dtfx

Source

cgillum

Most helpful comment

Another interesting scenario: if an activity function runs in the consumption plan and takes longer than 10 minutes to execute, this could result in an infinite crash loop. This is because the functions host will recycle to kill the long-running activity function. When the host restarts, it will try to re-execute that activity function and fail again (and this will continue indefinitely).

cgillum on 10 Oct 2018

👍2

All 4 comments

cgillum on 10 Oct 2018

👍2

Wondering if the team has any sort of timeline on this issue? Is it on the radar for the next release at all? We have a situation where our functions are prone to infinite crash loop and its producing a lot of errors in our customers error logs.

gorillapower on 15 Nov 2018

No specific timelines to share yet, other than that it won't be available for the v1.7 release (which is next). More likely it will make it's way into a v2.0 release.

Because any message is just one of a sequence of internal and inter-related messages, it's not really sufficient for us to create a simple "poison queue" like we would for a regular queue-trigger function. Instead, I suspect we'll need to introduce a new "suspended" state for orchestrations and come up with an experience for detecting, pausing, and resuming suspended orchestrations. This will need to be a new concept which is supported by the Durable Task Framework.

cgillum on 15 Nov 2018

👍1

Just out of interest, we implemented a workaround where we threw an exception from runaway function on a table storage lookup.

The calling function would after 10mins insert a record in Table Storage (the invocationId of the runaway function)
The runaway function, after restarting, would do a lookup on Table Storage and throw exception or complete if a record was found, specifically its current invocationId.

This at least stops the runaway function for running for days on end.

gorillapower on 4 Dec 2018

Was this page helpful?

0 / 5 - 0 ratings