We've been using Durable Functions to build human interaction workflows into our products and it works wonderfully. We have recently started to pay more attention to disaster recovery aspect and had our first successful practice run last week. We now look into some recovery scenarios that cause by human errors.
Let's say we're having a lot of inflight orchestrations that waiting for human approval and we accidentally delete some components of the Task Hub logical container including the queues, the tables and the blobs. Would it be possible to recover from that? I think the question really is how to protect against resource deletion which is something we've been doing by enable soft delete on blob container as well as apply resource lock, however seems like individual queues or tables can still be deleted.
If my question is out of place please feel free to redirect me to a better place to ask.
Thanks,
hey @anhhnguyen206
I think this is something that you need to solve at the Azure Resource level. A possible solution is to use RBAC in your environments, in order to minimise the footprint of users who are able to run delete operations against the queues.
https://docs.microsoft.com/en-us/azure/storage/common/storage-auth-aad#azure-built-in-roles-for-blobs-and-queues
Unfortunately, I don't know of any "soft-delete" capability for Azure Storage queues or tables, but I will ask around to see if I can get any more concrete answers regarding that.
In this case, I would follow @olitomlinson's recommendation of trying to prevent these human errors as opposed to trying to recover once they have happened.
That being said, if your orchestrations are entirely idempotent, you could theoretically store each orchestration's inputs as a blob as the first activity in your orchestration, and only delete them as the last activity of your orchestration. Then, in a catastrophic human error event, you could iterate through all of these blobs and start a new orchestration instance with the same inputs.
@olitomlinson @ConnorMcMahon thanks for all the good suggestions.
I think I'm going with @olitomlinson suggestion to reduce the number of people who has high access level to those critical storage resources.
@ConnorMcMahon, we're also doing that currently too - basically we're taking snapshot of the workflow state every time it moves to a new state and save that snapshot to the blob, then purge it once the workflow completes. We can that restart the workflow and get it back to its current state by using the snapshot.
Most helpful comment
Unfortunately, I don't know of any "soft-delete" capability for Azure Storage queues or tables, but I will ask around to see if I can get any more concrete answers regarding that.
In this case, I would follow @olitomlinson's recommendation of trying to prevent these human errors as opposed to trying to recover once they have happened.
That being said, if your orchestrations are entirely idempotent, you could theoretically store each orchestration's inputs as a blob as the first activity in your orchestration, and only delete them as the last activity of your orchestration. Then, in a catastrophic human error event, you could iterate through all of these blobs and start a new orchestration instance with the same inputs.