Azure-functions-durable-extension: History Retention

Created on 16 Jul 2017  路  26Comments  路  Source: Azure/azure-functions-durable-extension

Summary

Orchestration history will be deleted some number of days (e.g. 30 days) after the orchestration completes, fails, or terminates. Once this data is deleted, it will no longer be possible to query the status of the purged instances. The number of days will be configurable at the task hub level and the cleanup will be done automatically by the runtime.

Motivation

Orchestrations save their execution history in table storage. This history will grow indefinitely over time and increasing storage costs will be incurred on the Azure subscription.

Design Notes

When an orchestrator function completes, fails, or is terminated, a control message will be queued with an invisibility timeout of the retention period. Once the cleanup message is received, the host will take care of deleting the table storage records for that orchestration instance.

Drawbacks

  • There is no way to change the retention period of an orchestration instance after the cleanup has been scheduled.
  • There is currently no way to archive the execution history of orchestrations. If this is desired, a separate work item should track this.
  • Cleanup will incur I/O on the storage account. This may impact the performance of currently executing orchestrations.
dtfx enhancement

All 26 comments

@cgillum What's the current status of this? And is there any cleanup logic in place today?
Real question: if I want a particular instance with a specified id to only run once and, once completed, never be able to run again, is it okay to rely on GetStatusAsync indefinitely to check for its existence? Or should I expect that the record of its previous run will disappear eventually?

Right now we're leaning towards not implementing this due to the downsides involved. It should be safe for you to rely on the data always being there. Even if we do eventually implement retention policies, it would most likely be an opt-in feature that doesn't impact existing apps.

Hi @cgillum
According to the drawbacks, it is for the history table. However, we can do it for instance table to remove already finished instances for certain of the time period. e.g. we can set it on the host.json like 1 day or 7 days or something, then Durable asynchronously remove the older instance records. Or should I discuss it on a new issue?

I'd like to have some ideas for the four drawbacks.

  1. Impact for existing apps.
    -> For the instance table, it doesn't occur.
  2. There is no way to change the retention policy until you deploy new one.
    -> I don't have any idea. Accept it. For the instance table, it seems not big deal.
  3. We might need separate work item to clean up
    -> How about Durable Functions create a simple time trigger functions if we set config to the host.json? Then it is almost the same that customer create a custom Functions. Then it can achieve 4. as well and we don't need to have a separate work item. (just an idea)
  4. Clean up incur the I/O of storage account
    -> How about set the start time to clean up?

It's an interesting idea, here are some thoughts:

  • If we remove data only from the Instances table, then we have an inconsistency between the two tables. That might make it harder to implement a full data retention feature later which cleans up both History and Instances tables if the data is inconsistent.
  • I think your comment about (3) is very interesting. Maybe we can create an API on DurableOrchestrationClient which allows cleaning up instance data. Then developers can use a timer trigger like you suggest to implement any cleanup policy they want (and we don't have to implement/design one for them). I like this idea because then the developer is in control and can make the best decision for their scenario.
  • Doing clean-up at start time might be tricky, especially since there could be many VMs running at the same time and maybe some VMs start up later than others. I also worry that this could negatively impact cold-start time, but maybe that's not a big issue since it only blocks background processing and doesn't have to necessarily impact HTTP triggers or webhooks.

@cgillum Then Let's implement (3) with your suggestion. I can't clearly understand how to add the feature for the DurableOrchestrationClient. Could you share that? If we add a small change to the DurableOrchestrationClient, all we need to do is just add Document and some sample for it. :)

The activity functions could return sensitive data (should I add GDPR ;) ) as intermediate output. This return value is stored in the History table even after the orchestration is completed.
Is there anyway to clear this result?
A property on the 'DurableOrchestrationClientBase' to cleanup would be cool so that it can be set and the cleanup is done once the orchestration completes.

I need to have this feature in GA, my large messages will blow up my storage account

Hello @cgillum,

I wanted to confirm with you if the current idea for implementing this issue is still to add new API to the DurableOrchestrationClient responsible for cleaning up the instance data. And then developers can use a timer trigger to implement any cleanup policy they want.

Thank you!

@cgillum How is retention applied to data payload between inter-function communication. My orchestrator fans out to multiple activity functions and aggregates the response which has the potential to grow rapidly. What is the default retention here and what happens if the accumulated data grow significantly in due time?

@gled4er yes. At a minimum, I think we want the ability to delete all data associated with just a single instance ID. For more convenience, we can also have an API which takes in some filter parameters, like we do for the new instance query API that @TsuyoshiUshio recently added here.

@AbhishekTripathi All inter-function communication is associated with a particular orchestration instance ID. If a user deletes that instance, all the durable state related to that instance, including inter-function communication artifacts in storage, will get deleted. So if your orchestrator function fans out to 100K activity functions, and you then use the API to delete that instance, all the parameter data for those 100K activity functions will also be deleted. Does that answer your question?

Are we talking about an existing API or it is proposed? My question is about current state of things.

@AbhishekTripathi I'm talking about the proposed API (since that's what this GitHub issue is tracking). We don't have any retention policy today. All data stays in the storage account permanently unless it is manually deleted. So yes, it is possible that this data could grow quite a bit over time depending on the load.

@cgillum would it be available by v2 GA?

@AbhishekTripathi Ideally yes. If not, it should come shortly afterwards.

The function I talked about is built on v1 runtime. It goes without mentioning that proposed enhancement should be compatible with v1 functions too.

@AbhishekTripathi yes, that's the plan.

Hello @cgillum,

Thank you for the clarification!

This functionality has been added in the above PRs. I'm happy to say that it will be made available in the next release (v1.7.0)!

I see the purgeInstanceHistory API available in durable client, does it delete the history and instance from both the azure table storage.

Since the name says history its a bit confusing. Can you clarify if its delete from both table

I see the purgeInstanceHistory API available in durable client, does it delete the history and instance from both the azure table storage.

Since the name says history its a bit confusing. Can you clarify if its delete from both table

Update - Verified and it deletes entries from both the tables. Still feel the name is a bit confusing here.

@cgillum .. I think we still need it to be configurations. I understood that we have only API, but what we want is: not to remember and maintain this cleaning process.

@cgillum .. I think we still need it to be configurations. I understood that we have only API, but what we want is: not to remember and maintain this cleaning process.

@MhAllan while I'm inclined to agree, have you considered using a Recurrence-triggered Logic App or Timer-triggered Azure Function to do this for you so you _don't_ have to remember?

@brandonh-msft , Thank you for your reply.

Yes these solutions will work. But unfortunately they are very anti-cloud solution as they feel more like an on premise hack than a cloud design. especially if we think that diagnosing the "small, encapsulated microservice" depends on "another custom tool to make it perform good". Anther negative aspect is that, anyone is new to Azure functions will be wondering "Then when do these tables get truncated"? answer is: they don't, you do it! that is a major turn off in my opinion.

@brandonh-msft I have tried using a Recurrence-triggered Logic App to clean up both the History and Instances table for a Durable orchestration. However, when attempting to clean the Instances table, that table appears to have an empty RowKey for all records, therefore the "Delete Entity" operation doesn't succeed, claiming a RowKey value is required for the delete operation. Is it expected that RowKey is null for the Instances table? If so, is it possible to use Logic Apps to clean the table?

@brandonh-msft I have tried using a Recurrence-triggered Logic App to clean up both the History and Instances table for a Durable orchestration. However, when attempting to clean the Instances table, that table appears to have an empty RowKey for all records, therefore the "Delete Entity" operation doesn't succeed, claiming a RowKey value is required for the delete operation. Is it expected that RowKey is null for the Instances table? If so, is it possible to use Logic Apps to clean the table?

I see the History table _does_ have RowKey set... are you seeing something different?
image

Instances table, however, you are right does not have a value in RowKey - I will see if I can create a workaround for this in lieu of DF being updated to set the rowkey to an actual value (cc @cgillum )

Update: I'm not able to see a way to make this happen. I'll reach out to the Logic Apps team for some insight here.

@brandonh-msft Yes, that's consistent with what I see as well - History table does have a RowKey and Logic App cleanup works fine for that table. It's just the Instances table that doesn't have a RowKey, so I can't get Logic Apps to work for cleaning that one. Would be great you could suggest a workaround or if DF could be updated to populate a RowKey to enable Logic Apps to cleanup that table. Thanks!

Was this page helpful?
0 / 5 - 0 ratings