Off the back of #1128 , it would be nice if we could force an application to loose it's lease and return to its previous state.
This would allow us to support staging slots \ DR better without having to turn off a function instance to ensure it hasn't got a lock.
IMO the danger of automatically failing back over to the primary is that we can't know if all other application dependencies have been similarly failed over. For example, if the Durable processing fails back over to the primary, what happens if the HTTP load balancer or some other dependency has not yet failed over? I feel that it's safer to let the app owner make the call about when it's safe to fail back over to the primary because they ultimately understand the full end-to-end architecture of the system and its dependencies.
@cgillum Not quite sure what you mean "some other dependency has not yet failed over?" Can you provide an example?
In my use-case, we are going to use a Traffic Manager in front of both Primary and Secondary Function Apps in a active-failover configuration so all traffic by default goes to the primary if that probe is getting a satisfactory response from the healthcheck endpoint. If the primary probe failed (possibly due to a compute outage) HTTP requests would eventually route to the Secondary. The calls to DurableClient (like StartNewAsync/WaitForEventAsync/RaiseEventAsync) would queue up in the TaskHub, and eventually the Secondary would take ownership of the TaskHub and begin orchestration execution.
Now suppose the Primary Function App came back online, the Traffic Manager probe would eventually route requests back to the Primary Function App. But now my orchestrations will indefinitely continue to process on the Secondary, incurring latency and additional ingress/egress if out of the region.
Ideally, once the Secondary Function App has taken the lease, the secondary should be willing to give up its lease if the Primary was to come along and make a claim for ownership.
This means that the Primary and Secondary would need to be explicitly told which is the Primary and Secondary. An App Setting configuration would be best here as that can be easily varied across environments (unlike host.json) using any kind of Infrastructure-As-Code.
"DURABLE_FUNCTIONS_PRIMARY" : true <- for the primary
"DURABLE_FUNCTIONS_PRIMARY" : false <- for the secondary(ies)
No App Setting configured? No problem, the default behavior wouldn't provide any auto-recovery to the primary and keep the behavior as-is today.
Alternatively, if you are suggesting there needs to be some form of signal to the primary to attempt to take back control, you could provide an API on DurableClient which can be consumed in the users healthcheck endpoint.
[FunctionName("healthcheck")]
public async Task<IActionResult> RunAsync(
[HttpTrigger(AuthorizationLevel.Function, "get", "post")] HttpRequestMessage HttpRequest,
[DurableClient] IDurableOrchestrationClient client)
{
// 1. do other dependency health checks first
...
// 2. finally try to reclaim lease
var isPrimary = (bool)Environment.GetEnvironmentVariable("DURABLE_FUNCTIONS_PRIMARY");
var AppId = // dont know how you get this;
if (isPrimary) <--- we're in the primary environment because of app setting 'DURABLE_FUNCTIONS_PRIMARY" : true'
{
if (!client.HasAppLease(AppId))
{
// we don't have the lease, but this app is healthy and ready to serve traffic once we get the lease.
client.AllowAppToStealLeaseAtNextOpportunity(AppId);
// we've asked for the lease back, but we don't have it yet. We might have it at the next health check interval.
return new BadRequestObjectResult(responseMessage); <--- unhealthy
}
else
return new OkObjectResult(); <---- yay, we've got the lease back on the primary, go healthy
}
}
I quite like the idea of the app setting option as this also covers apps that aren't using Traffic Manager and HTTP triggers to start the orchestrations, for example, timer or blob triggers.
@cgillum any thoughts on my comment?
@olitomlinson I haven't had a chance to think about it too deeply yet, but between the two options I likely to lean towards an API-based solution rather than an app settings-based solution.
@cgillum I think an API based solution would make sense, as it means we can initiate the swap when we are ready to.
I think you need an App Setting/environment variable anyway, otherwise how can each app instance identify wether or not it is the desired primary?
@cgillum
after some testing and pondering, I really need a way to bias the lease towards a desired primary environment.
This is because the rest of my system is running in UK South, so I want to keep the orchestrations running in South where ever possible (to minimise cross-regional costs and latency), unless of course south becomes unavailable in which case UK West can steal the lease while south is down.
But when South becomes available again, and is in a position to start attempting to claim the lease, then West should give it up at the first opportunity.
I鈥檇 also like to reiterate that I don't always use traffic managers as some of my durable Apps are mostly queue-driven back-end services, so any mechanism of switching the lease between regions should ideally be decoupled to traffic manager probing.
Thanks 馃槉
Most helpful comment
I think you need an App Setting/environment variable anyway, otherwise how can each app instance identify wether or not it is the desired primary?