Microsoft-authentication-library-for-dotnet: Support regional ESTS (ests-r)

Created on 25 Mar 2021  路  12Comments  路  Source: AzureAD/microsoft-authentication-library-for-dotnet

Problem

AAD is adding support for regional STS (ESTSR). Currently only S2S (client_credentials) flow and available via opt-in (1st party only for now). Although a DNS-only solution is preferred, this is not practical. The client SDK needs to help route the traffic to the regional authority.

Goals

  1. Provide an API that allows traffic to be routed to a regional authority (P0)
  2. Help developers detect the region (P1)
  3. Existing telemetry needs to work. We must detect if user region != detected region and report it via telemetry.

Proposal

// if app knows the region, use it as MSAL cannot reliably detect it
string region = Config.Region ?? ConfidentialClientApplication.AttemptRegionDiscovery;

var cca = ConfidentialClientApplicationBuilder(client_id)
                  .WithAuthority(AzureCloud.PublicCloud)
                  .WithRegion(region)  
                  .Build();

Behavior

  • if region = null, ignore
  • if cloud is PUBLIC cloud, use {region}.microsoft.com
  • otherwise, use {region}.environment
enhancement ConfidentialClient WebApiCallsWebApi WebAppCallingWebApis

All 12 comments

What about, for the detection:

string region = await ConfidentialClientApplication.AutoDetectRegion();

To be able to send telemetry on if we auto detect correctly or not, the region returned by the detector could even contain a character (an exclamation mark at the end?), to signify that the region was auto detected. And of course, .WithRegion would remember it was auto-detected, and trim the extra character and use the region

  1. For the detection, do you mean a static method on ConfidentialClientApplication or an instance method? We'd probably need to pass in a logger as well, so it'll be more like:
string region = await ConfidentialClientApplication.AutoDetectRegion(LogCallback);
  1. For the telemetry, I like your idea, but I think we should go all in on it:
  • if env variable detection, return "uswest1_msal_autodetect_variable"
  • if imds detection succesfull, return "uswest1_msal_autodetect_imds"

The alternative would be to have a static variable that captures this.

for the cca options, something like public string? AzureRegion {get; set;} would work i think. would be nice to have the default be the AutoDetectRegion.

For the detection, what if this comes to pca? do we want something that is more generalized? if not, i think the above looks good.

@jennyf19 - yes, we discussed about adding it to public client app but since it's unlikely to happen anytime soon (i.e. years), decided not to polute the api and keep in at confidential level only. It's easy to add to PCA later.

Thanks for looking at the API @jennyf19, I agree on the Options part.

After I wrote out thoughts below, I guess I'm leaning to a solution that's similar to what we have now. I guess I'm not a huge fan of having to call TryGetRegionAsync before creating PCA. I'd rather have it simply encapsulated inside WithRegion.


1 Doesn't using WithRegion already imply autodetection? (So, AutoDetectRegion is unnecessary?)

  • If WithRegion() (null / empty string region parameter) is used, we detect region, and use it.

    • If region cannot be autodetected, we log error and fall back to global.

  • If WithRegion(userRegion) is specified, we still detect region; send both regions in telemetry, and use the user-provided region for requests.

    • If region cannot be autodetected, we use the user-specified one.

2 In the simple scenario code snippet, is calling a RegionDetector.TryGetRegionAsync a prerequisite to calling WithRegion(region)?

  • I don't think it should be. Users should just call WithRegion() and it will be auto detected for them.
  • I think RegionDetector.TryGetRegionAsync should still be public, so that end users can get the region detected by MSAL for their own purposes.
  • WithRegion will internally call RegionDetector.TryGetRegionAsync which will save the detected region into some static variable like we do now.

3 Provide a mechanism to fallback to the non-regional authority in case the regional one is down (P2)
We would never know if regional one is down, right? Gateway will just forward the request to global and not tell us.

Thanks @pmaytak for the feedback.

After playing around with the idea for a while, I realized that it's very inconvenient to introduce new top-level options in MSAL, because those objects need logging, HTTP stack etc. all of which are configurable in our Application objects. So having a top-level option would be like:

RegionDiscovery.GetRegionAsynz(LogCallback logger, IHttpClientFactory httpClientFactory);

This is too complicated, so went back to WithRegion(region) and WithRegion("auto")

From an outsiders perspective how does failover work w.r.t. a regional outage?

@henrik-me is probably the best to answer this.

From an outsiders perspective how does failover work w.r.t. a regional outage?

@jabbera : Not sure exactly what you mean. Can you please provide more context to the question?

Region means the traffic will stay in the region, thus impacts from other regions or even global will not impact a service running in a another region.

@henrik-me the scenario described assumes that Azure AD and the region will always fail together. Is it possible for Azure AD in EastUS2 (EU2) to have an outage but the region itself would still be up? In that case don't want to be glued to only EU2 for my azure ad auth. I don't know how the Azure AD network is designed, but if {region}.microsoft.com had a GTM with primary\hot standby (in central or something) in front of it my hypothetical scenario may not be an issue.

@jabbera, thanks for the added details to your question. Networking issues including DNS issues are not covered with this change, for DNS and network handling we allow you to pass in a HttpClientFactory where you can do what is right for your service to mitigate those issues (e.g. small outages to DNS would typically not be felt as DNS is cached on the node and only refreshed at certain intervals, you can decide which interval would be the right for your scenario).

Two other perspectives that regional and MSAL in general helps with for the scenario you are describing is:

  1. MSAL has tokens in the cache and depending on the outage period and the lifetime of the token there will be less of a need to call AAD for a token, and thus mitigates if say just AAD in the region is down.
  2. AAD regional also have lots of failure modes covered, including a gateway in front of it which can re-direct traffic as needed.

Hope this helped bring some perspectives to the question you raised. Please also feel free to share what you believe should happen and how it can be done.

Was this page helpful?
0 / 5 - 0 ratings