Azure-functions-durable-extension: Support for Geo-redundant Application

Created on 27 Dec 2019  Â·  35Comments  Â·  Source: Azure/azure-functions-durable-extension

We want to deploy our application in multiple regions to support failover incase of region failure.

Approach
Deploy application to multiple regions and use same TaskHub across all regions and make sure only one application is running at any given point in time. Turn off always on flag in the application so that the application wont pick up orchestrations as soon an the region recovers. Once a regions fails, start the application in backup region so that all the orchestrations in-flight will be picked up by backup application.

Problem
The application is picking up orchestrations from TaskHub as soon as it starts even before any http requests are received with always on flag off.

Repro-steps
(All the below steps are done using Azure Portal)

  1. Turn of always on flag and start the application and create an orchestration
  2. when the orchestration is running, stop the application, wait for few minutes and start the application
  3. Ideally, the application is not supposed to pickup in-flight orchestrations from taskhub until it receives a http request, but I see that the application is picking up all orchestrations from taskhub

Proposed Solution
Feature to make sure that an orchestration can only be picked up by one instance of application even when multiple instances are running and using the same taskhub.

dtfx enhancement

Most helpful comment

@luthus No configuration changes are necessary to get the updated behavior once you've upgraded to v2.3.0. And you are correct that non-durable triggers are unaffected, and that you'll need to use the HostId trick as well to prevent duplicate timers.

I'm not aware of any plans to implement this across other trigger types. We did this specifically for Durable Functions not only to enable better high-availability support, but also because it was a big source of runtime problems for users when they don't properly isolate their apps. That said, if you think it would be valuable to have some setting for this that works for all function apps, it might be worth making a feature request for this in the Azure-Functions GitHub repo.

All 35 comments

Much needed!

Also, as I have previously discussed on Twitter with @cgillum it would be advantageous to support an active-active scenario where multiple instances of the Durable Function App could operate on the same orchestrations.

This will likely be pushed into our next minor release, as we want to get 2.2.0 out in the next week.

I met with @amdeel today to discuss our options for this in more detail. Here is the notes for the proposal we came up with:

Problems to be solved

I think we can design a feature that covers all the problems mentioned below. These problems are ordered by priority:

  1. Geo-redundancy: Mission critical production apps should always support geo redundancy. If data center goes down due to some emergency (live-site incident, power failure, natural disaster, etc.) then operators need their applications to "fail-over" to other, healthy regions. This means they have to be able to deploy multiple copies of their app to multiple regions. Right now, this can cause the problem described in (2) below.

  2. Accidental task hub conflicts: The current design of Durable Functions makes it very easy for someone to accidentally run two copies of their app at the same time in the same task hub. This could be too copies of the app running in Azure, or it could be one copy in Azure and another on a developer's laptop. When this happens, the two apps steal messages from each other in a way that is really hard to debug.

  3. Awareness of issues in the Durable Task Framework (DTFx): Today the DTFx logs that we emit are not visible to our users. This means our users have less information to use when troubleshooting issues and means that we have to do more work to find these logs when debugging customer issues. This includes issues like (2) above, which are primarily issues in the DTFx layer

Proposal

Here is a design proposal for this work-item which can solve the above three problems:

  1. Update the AzureStorageOrchestrationServiceSettings with a new AppName property. By default, the Durable Functions extension will use the name of the function app. If running locally, the name of the local machine (i.e. Environment.MachineName will be used).

  2. Every task hub has a blob named {hub-name}-leases/taskhub.json, which is created the first time the task hub worker starts. By default, we will now also maintain an "app lease" on this blob.

  3. The lease-id parameter used will be based on the AppName setting mentioned previously. Since Blob Storage requires lease-ids to be formatted as GUIDs, so we simply hash the AppName value into a GUID. See here for a good example of how to do this.

  4. The orchestration service will not start unless it is able to acquire the "app lease". AzureStorageOrchestrationService.StartAsync is where the starting happens. Our app lease guard can happen here, before we attempt to start the partition manager. Note that it's not clear what the side effects of blocking StartAsync would be. For example, this could potentially cause the Functions host to fail to start and log a bunch of errors, which is not desirable. If this is a problem, then we may need to refactor some of our existing code to allow StartAsync to complete but still prevent the partition manager from starting until the app lease is acquired.

  5. The lease timeout and renewal intervals are configurable with reasonable defaults: settings.AppLease.LeaseAcquireInterval (5 minute default) controls how frequently we try to acquire the lease, settings.AppLease.LeaseRenewInterval (3 minute default) controls how frequently we try to renew the lease, and settings.AppLease.LeaseInterval (10 minute default) controls how long the lease lasts. By default, these delays are fairly long. However, every worker instance will be constantly trying to acquire or renew the same lease. As the number of worker instances increase, the frequency and volume of lease transaction also increases. We want defaults that have an acceptable failover time without adding too many billed transactions against the storage account.

Documentation

We'll need to provide documentation and guidance on what customers should expect and how they can properly use this. For example:

  • The app "primary" needs to be deployed and started first, before any "secondary" apps.
  • To fail-over to another app, the current lease-holder app needs to be stopped so that the lease can expire, and the other app can take it.

We can add a description of this behavior and these guidelines to this existing article.

Troubleshooting

There are also potentially things that could go wrong that we need to bring attention to. For example, if a customer has a second copy of their app (maybe a local copy) that uses the same task hub as one in production (or the other way around) and therefore doesn't start up. How will the developer/operator know about it?

To account for this, this proposal also recommends doing the work to enable pushing DTFx logs into the Azure Functions ILogger infrastructure. For example, the Azure Functions ILoggerFactory can be provided to the DTFx layer so that a separate log category can be created for DTFx logs. This can potentially be broken out into a separate work item with a more detailed design spec.

Lastly, we should consider having a way for users to turn off this new behavior in case it causes side effects in situations we did not consider.

@cgillum Great to see this being talked about!

Couple of initial thoughts

  1. Given a scenario where App A is primary and App B is secondary, would App A be the only instance that can utilise DurableClient for starting new orchestrations and raising events, or would both A and B be able to use DurableClient _at the same time_? I would anticipate both at the same time, but just wanted to check as this is an important distinction to be clear on.

  2. Am I correct in assuming that the only thing that would need to be configured from a developer perspective is ensuring both Apps A and B share the same DurableFunctionsStorage connection string to enable geo-redundancy? If so, great, simplicity is best here.

  3. Please try not to make host.json play an active part in the configuration of Geo-redundancy, particularly if host.json must diverge between app A and B. Reason being is from an Infrastructure-As-Code perspective, its best practice to not have any _messy_ steps in the I-a-C pipeline which transforms configuration files etc. It would be incredibly painful to start customising my release pipeline with transformation steps/Tasks to ensure certain values are set in host.json for each App Service.

  4. Are there any considerations that need to be made when rolling out new user-code versions to A and B, particularly when making breaking changes to the user-code? If the apps go out-of-sync, it could be disasterous. In the context of a DevOps release, It might be worth considering making the Azure App Service Deploy task _aware_ of geo-redundancy, so DevOps can provide a good UX to prompt the operator that releases to both App Service Instances must be synchronised, thereby reducing the risk of different versions of code being deployed for long periods of time.

Hey @olitomlinson thanks for these good thoughts. My responses:

  1. I can't immediately think of any harm that could be caused by allowing both App A and App B use the IDurableClient concurrently. In fact, it might even be beneficial to allow this. The important thing is that only one of them are actually processing orchestrations/entities at any given time.

  2. A developer/operator would need to use the same _storage account_ and _task hub name_ for each region. No new configuration should be needed.

  3. Makes sense. I don't anticipate that we'll need any host.json configuration other than the connection string name of the task hub name, both of which can be specified using app settings. I think that should be good to ensure that no weird transforms are needed for host.json.

  4. I hadn't really thought about this from a versioning perspective. Today, our versioning recommendation is to deploy different versions as new apps with new task hubs to avoid conflicts. I think that advice remains., even across geo-redundant deployments. So you'd basically have version A, Av2, B, and Bv2 which have v1 or v2 task hubs. In that model, I don't think you necessarily need synchronization. Let me know if I'm misunderstanding.

  1. Great. This would certainly help (possibly even fully solve) the multi-region event routing problems that we had to code around. Discussed here https://github.com/Azure/azure-functions-durable-extension/issues/970 . Regards to your comments about having the Secondary app _not_ running orchestration code/entity code - I agree. In-fact, I think if you can make the secondary app run as normal as possible and only suppress the orchestration/entity code-points, this would be best from a dev ex perspective. Putting myself in the shoes of a user who is first configuring geo-redundany, I would appreciate as little caveats as possible, with a best out come of "it works just like a normal Function App instance, but orchestration and entity functions _may_ be suppressed depending on their ability to take a lease at start-up."

  2. / 3. Perfect!

  3. Fair. sounds like a non-issue.


New requirement

publish an Event Grid event when

  • An App instance _loses_ its app lease
  • An App instance _aquires_ an app lease

It's becoming an increasingly popular concept for Azure Services to publish timely management events. They can be really helpful to users who want to perform operations when things happen. I personally would use these events to pump a notification into my teams slack channel to let us know of a switch of regions so we know what region is currently primary without having to dig through App Insights etc. If a lease-lost notification came into slack, id probably be a little nervous if a lease-acquired notification didn't come in shortly after, which would warrant me to investigate. In reality, id probably automate an alert in this situation too, but I wouldn't be able to do it without the hooks (event grid events)

FYI Sean Feldman is maintaining an issues repo of azure services that could benefit from publishing specific management and context events.

I like the distinction that the secondary app could theoretically run activity code. For consumption app scenarios, it isn't necessarily crucial if the secondary app is just sitting there doing nothing, but for dedicated customers, if they are paying for the compute, they will want to use it for what it can be used for.

That does lead into the question of how consumption/premium scaling works on the secondary application. We would almost certainly want to put some limits on how much the secondary app can scale until it has the app lease.

Also, thanks for the link to that repository @olitomlinson. I will send it the app service specific query to our control plane folks to see what it would take to support Event Grid for those scenarios.

I like the distinction that the secondary app could theoretically run activity code.

@ConnorMcMahon Just to remove ambiguity, can I read into this and assume you also think the following...

"I like the distinction that the secondary app could theoretically run activity code and other non-orchestration trigger functions" ?

Permitting everything but OrchestrationTrigger functions, would be the best outcome for my use case. I have two isolated copies of my Durable Function App (isolated TaskHub each) running in prod, (one in UK South, one in UK West) I use front-door to load balance events across both those Function Apps. I have to relay events from one region to the other if the orchestration Id is not found in the region that the request arrives into.

If we make the secondary always support all function trigger types (except orchestrationTriggers) I can delete one of my TaskHubs and have both function App instances targeting a single TaskHub, and then completely get rid of the code that relays/routes requests to the alternate region.


With regards to scaling and cost, purely from my perspective, I can't see this as being a problem. As a customer, I expect to get charged based on my plan type. However, I appreciate you will have other kinds of personas to cater for :)

In theory, a customer could have the primary running on Premium and the secondary on Consumption, but this aligns closer to an active-passive failover scenario, whereby the secondary is typically under-provisioned due to cost saving, as the secondary is not seen as essential to the business, but purely exists as a box-ticking exercise ("Yes, our product has have DR"). Of course for the customers who do want high through-put and low-latency cold starts from their secondary, they are free to provision a Premium plan as they wish.


You're welcome :)

@olitomlinson, yes that is how you can read that, with the caveat of also not including entities.

The reason I exclusively called out activities is because those are the only trigger types other than orchestrations and entities that we control from the Durable extension perspective.

The reason I was concerned about scaling is from a cost perspective. If both apps are running on premium, since both scale controllers would be following the same scaling logic (since they are pointed at the same resource), we are effectively giving the customers twice as many workers as we would give them if they are only on one app. When you add the fact that the secondary app is unable to process any orchestration/entity triggers, the customer may be paying for compute that they don't want to pay for.

Consumption is a similar problem, but then we would be the one eating the costs :).

@ConnorMcMahon

Ah yes, got you!

—

I’m sure I’m absolutely over-simplifying here, but when an app service is _secondary_, would it be possible to unregister(?) that thing(?) in the scale controller which would be typically be observing the necessary metrics for scaling a DF app? allowing the secondary to scale at a different cadence to the primary?

Not meaning to cause offence if I have trivialised, what I assume, is a very complex domain :)

That is along the approach I was thinking. It just means we have to handle this case in our scaling infrastructure, which is on a relatively slow deployment cadence, so it may not be deployed by the time we support this feature on the extension side.

A few comments:

I like the distinction that the secondary app could theoretically run activity code.

While this could be nice as a way to leverage otherwise idle capacity, I would recommend against this because it might be counter-intuitive for users. There could also be negative side effects, such as accidentally executing code on non-production compute (like someone's laptop, which could violate certain compliance requirements) and could result in important activity function telemetry getting lost. For those reasons, I think we should also prevent activity functions from executing unless this lease is held. If folks want to do geo-distribution, then I think it should be done explicitly using API gateways and HTTP triggers, etc.

That does lead into the question of how consumption/premium scaling works on the secondary application. We would almost certainly want to put some limits on how much the secondary app can scale until it has the app lease.

In this case, the scale controller on the secondary region would vote to add instances because it sees the queue activity from the primary region, but the local VM allocator would deny those requests because it sees that there aren't actually any function executions associated with the app. I therefore think we don't need to have any special handling for this.

publish an Event Grid event when

  • An App instance loses its app lease
  • An App instance aquires an app lease

This is a good idea! We have been encouraged to publish more events into Event Grid, and I can see how these would be really useful. I suggest we create a separate issue to track the Event Grid integration, though.

@cgillum

For those reasons, I think we should also prevent activity functions from executing unless this lease is held.

You make a good point, and I agree that the default behavior should be to not allow activities to run on the secondary, for the important considerations you make.

However, may I present that If there are no other circumstances that would complicate or prevent allowing this behavior, can I ask that you make it an advanced configuration setting? possibly in host.json "allowExternalActivities" : true. I'm pretty confident that given my entire set-up is in Terraform, I'm highly unlikely to make a mistake and execute activities outside of their intended production environment.

On reflection, I can see that this is very much a nice-to-have in certain use-cases, but largely wouldn't be used by many people, and could cause more trouble than its worth so I understand if you don't prioritise this.

I still maintain that allowing the secondary app to function as normal for non-orchestration, non-entity and non-activity triggered functions, would be highly useful to allow usage of the DurableClient.

@cgillum

The lease timeout and renewal intervals are configurable with reasonable defaults: settings.AppLease.LeaseAcquireInterval (5 minute default) controls how frequently we try to acquire the lease, settings.AppLease.LeaseRenewInterval (3 minute default) controls how frequently we try to renew the lease, and settings.AppLease.LeaseInterval (10 minute default) controls how long the lease lasts. By default, these delays are fairly long. However, every worker instance will be constantly trying to acquire or renew the same lease. As the number of worker instances increase, the frequency and volume of lease transaction also increases. We want defaults that have an acceptable failover time without adding too many billed transactions against the storage account.

Would the LeaseInterval be the time at which the current primary doesn't get to renew its lock any further and the lease becomes up for grabs by whichever region makes the next request first? I'm interested in reducing unnecessary switching between the regions. For me personally, I'd be quite happy for the current primary to be given forever to renew, thereby limiting the leader-election to only scenarios where a fault has occurred which has prevented the primary renewing its lease.


This is a good idea! We have been encouraged to publish more events into Event Grid, and I can see how these would be really useful. I suggest we create a separate issue to track the Event Grid integration, though.

Great, as @ConnorMcMahon was saying, if this could become a first class-signal in Azure Monitor, I could build a rule which says if app-service-durable-lease < 1 for greater than X minutes, trigger an alert which would save me the effort of having to build something that manually tracks the two events and then publish my own signal etc.

For me personally, I'd be quite happy for the current primary to be given forever to renew, thereby limiting the leader-election to only scenarios where a fault has occurred which has prevented the primary renewing its lease.

Correct me if I'm misunderstanding, but if we made the LeaseInterval value infinite then it would be impossible to automatically fail over to the secondary (because the primary's lease never expires). As designed, failover will happen only if the primary fails to renew its lease within the 10 minute window. As long as the primary is healthy, the failover will never happen because it renews the lease every 5 minutes.

Sorry I'm likely misunderstanding the purpose of the LeaseInterval. My assumption was that the LeaseInterval was the maximum duration that an app could renew its lease for, before it became available for anyone to acquire again.

Either way, I'm positive what you've suggested will work just fine. I don't want to take up any more of your time on this so don't worry about explaining it to me, ill pick it up in due course :)

@cgillum @ConnorMcMahon just curious but is this still something that is still going to be worked on?

@mpaul31 Yes. It's currently in the 2.3.0 milestone.

Hey @amdeel @cgillum

Is this still anticipated for 2.3.0?

It’s becoming harder and harder to convince senior stakeholders that DF is fit for enterprise workloads without geo-redundancy/active-failover operation.

Hey @olitomlinson, yep, I totally understand. We have internal teams within Azure who are also very eager for this for the same reasons. :) It's still part of our 2.3.0 plan and it's under active development.

@cgillum great thanks Chris! This is such good news for us!

Are we still developing along the lines of everything but orchestrationTriggers will work in the secondary at the same time as the primary, and the leasing is just to decide which App runs the orchestrations?

@olitomlinson No, I’m still wanting to block activities as well, per our previous discussion. I have a lot of concerns about allowing two independently managed apps being able to process the same data at the same time.

@cgillum would you at least consider it a configuration option to allow activities in the secondary?

Aside form your previous concerns about having local App Services picking up prod workloads, I don’t understand the concern of allowing multiple apps to consume one data source. Are you in a position to elaborate?

It’s a terribly common pattern to have multiple App Service instances integrating with a single storage account or any other persistence store, for that matter. I have many active-active configurations in place in this manner for other parts of my system.

Im curious to understand what makes DF any different with regards to Activities such that you couldn’t have them being processed concurrently across two or more App Services.

@cgillum is there currently any documentation for this feature?

@cgillum Thanks! Just to make sure that I've understood, would I simply need to upgrade to 2.3.0 for this to work if I already have two or more apps sharing the same storage account? The linked article doesn't mention whether any configuration needs to be made on the application itself.

Am I right to assume that this change wouldn't affect non-durable triggers? Would the HostId still need to be set to prevent timer triggers from being triggered in both the active and passive instances?

If this change doesn't affect non-durable triggers, are there any plans to implement this across the core Azure Functions for timer and queue triggers etc?

@luthus No configuration changes are necessary to get the updated behavior once you've upgraded to v2.3.0. And you are correct that non-durable triggers are unaffected, and that you'll need to use the HostId trick as well to prevent duplicate timers.

I'm not aware of any plans to implement this across other trigger types. We did this specifically for Durable Functions not only to enable better high-availability support, but also because it was a big source of runtime problems for users when they don't properly isolate their apps. That said, if you think it would be valuable to have some setting for this that works for all function apps, it might be worth making a feature request for this in the Azure-Functions GitHub repo.

@cgillum Thanks for that info :)

With the the new lease code, two things I'd like to check:

  1. Am I right in thinking that both functions need their WEBSITE_SITE_NAME set to be identical?

  2. Is there a way to force a function host to drop it's lease? For example if we have a Prod \ DR setup, and we patch Prod, the lease would shift over to DR. If we wanted to shift it back, currently am I right in thinking we'd have to stop the DR instance?

@bendursley,

  1. They do not need the same WEBSITE_SITE_NAME (in fact, I don't think that is possible in our hosted plan). All you need is for both function apps to share a storage account and task hub name, as the app lease lives in the task hub namespace.

  2. I think the only real way to shift it back would be to stop your disaster recovery app.

Thanks @ConnorMcMahon. I assume with 2, that would be a similar issue if we used staging slots as well?

Correct. For any two apps that share a storage account + task hub combination (be that multiple slots for the same app, or separate apps in different regions, or even in separate clouds).

@ConnorMcMahon \ @cgillum is there anything on the roadmap for Durable functions to allow us to push the app lease to another service?

@bendursley +1

@cgillum There should be a nicer way to migrate the app lease back to the primary, without down time.

This is important for me because I don't want to have to suffer the latency and egress/ingress cost of accessing a TashHub from a different Azure Region.

Also, If the Primary compute region becomes healthy (the region where TaskHub storage and compute is co-located) then I would like it to fail back to the Primary automatically and not have to perform a manual action myself to make it happen.

@bendursley, I am not sure how we would do that in an extensible way, but would be happy to hear any proposals for that.

@olitomlinson, we were debating about having a more automatic primary/secondary recovery, but it greatly complicate much of the logic. For instance, how do we determine which app is the primary? If it is via an app setting or a host.json configuration, how do we prevent multiple apps from thinking they are primary?

I think both of these improvements definitely have merit, but they need more discussion to be fleshed out. Let's create a separate issue for each, and I would love to hear your inputs and proposals on how we could make the georedundancy feature easier to use!

There should be a nicer way to migrate the app lease back to the primary, without down time.

IMO the danger of automatically failing back over to the primary is that we can't know if all other application dependencies have been similarly failed over. For example, if the Durable processing fails back over to the primary, what happens if the HTTP load balancer or some other dependency has not yet failed over? I feel that it's safer to let the app owner make the call about when it's safe to fail back over to the primary because they ultimately understand the full end-to-end architecture of the system and its dependencies.

BTW, I agree a new GitHub issue would be a good place to track proposals related to this so that we don't lose track of it. :)

Was this page helpful?
0 / 5 - 0 ratings