Azure-sdk-for-js: @azure/identity does not work with AAD Pod Identity on first request

Created on 28 Jan 2021  Â·  12Comments  Â·  Source: Azure/azure-sdk-for-js

  • @azure/identity
  • 1.2.2
  • Azure Kubernetes Service (AKS)
  • [ x] Node.js

    • 14.15.0

Describe the bug
We have a series of Node.js microservices running in AKS. To avoid credential storage in applications, we are using AAD Pod Identity within our cluster for connecting to Azure resources such as Azure Postgres Server and Azure Service Bus.

The first request for a token is always too slow to be established by AAD Pod Identity and there isn't a way for the Identity to wait for it to be created so the application throws an error. One the second request it works fine.

This is problematic as every first deployment will always fail first attempt.

To Reproduce
Steps to reproduce the behavior:

Taking Service Bus as an example if we do the following simplified version of our code:

const credentials = new DefaultAzureCredential() // also tried going straight for ManagedIdentityCredential() too
const client = new ServiceBusClient(myServiceBusInstance, credentials)
const sender = client.createSender(myQueue)
await sender.sendMessage(myMessage)

The last line will throw the following error on the first attempt, but will work second attempt.

Error: EnvironmentCredential is unavailable. Environment variables are not fully configured.
Error: ManagedIdentityCredential - No MSI credential available
Error: Azure CLI could not be found.  Please visit https://aka.ms/azure-cli for installation instructions and then, once installed, authenticate to your Azure account using 'az login'.
Error: Visual Studio Code credential requires the optional dependency 'keytar' to work correctly
    at DefaultAzureCredential.<anonymous> (/home/node/node_modules/@azure/identity/dist/index.js:285:29)
    at Generator.throw (<anonymous>)
    at rejected (/home/node/node_modules/@azure/identity/node_modules/tslib/tslib.js:115:69) {
  errors: [
    CredentialUnavailable [Error]: EnvironmentCredential is unavailable. Environment variables are not fully configured.
        at EnvironmentCredential.<anonymous> (/home/node/node_modules/@azure/identity/dist/index.js:896:27)
        at Generator.next (<anonymous>)
        at /home/node/node_modules/@azure/identity/node_modules/tslib/tslib.js:117:75
        at new Promise (<anonymous>)
        at Object.__awaiter (/home/node/node_modules/@azure/identity/node_modules/tslib/tslib.js:113:16)
        at EnvironmentCredential.getToken (/home/node/node_modules/@azure/identity/dist/index.js:862:22)
        at DefaultAzureCredential.<anonymous> (/home/node/node_modules/@azure/identity/dist/index.js:272:52)
        at Generator.next (<anonymous>)
        at /home/node/node_modules/@azure/identity/node_modules/tslib/tslib.js:117:75
        at new Promise (<anonymous>),
    CredentialUnavailable [Error]: ManagedIdentityCredential - No MSI credential available
        at ManagedIdentityCredential.<anonymous> (/home/node/node_modules/@azure/identity/dist/index.js:1221:19)
        at Generator.next (<anonymous>)
        at fulfilled (/home/node/node_modules/@azure/identity/node_modules/tslib/tslib.js:114:62)
        at processTicksAndRejections (internal/process/task_queues.js:93:5),
    CredentialUnavailable [Error]: Azure CLI could not be found.  Please visit https://aka.ms/azure-cli for installation instructions and then, once installed, authenticate to your Azure account using 'az login'.
        at /home/node/node_modules/@azure/identity/dist/index.js:1403:43,
    CredentialUnavailable [Error]: Visual Studio Code credential requires the optional dependency 'keytar' to work correctly
        at VisualStudioCodeCredential.<anonymous> (/home/node/node_modules/@azure/identity/dist/index.js:1604:23)
        at Generator.next (<anonymous>)
        at fulfilled (/home/node/node_modules/@azure/identity/node_modules/tslib/tslib.js:114:62)
  ]
}

Expected behavior
I'd expect the client should wait for the credential or should have the option to retry if unavailable.

Azure.Identity Client Service Bus customer-reported needs-team-attention question

Most helpful comment

Thanks for the detailed comments @johnwatson484.
I'll look into the parts that need to be added to service-bus.

All 12 comments

@johnwatson484

Hello, John! Thank you for submitting this issue.

I would like to understand this more. When you say the second time it works fine, how are you triggering this second request?

Our Identity package doesn't yet consider the possibility of credentials being eventually available. I'll bring this issue to my team to see what other ideas they have, but in the mean time, hoping this question could help, would it be possible to wait on your side until the environment is ready for the authentication?

I'll keep an eye on this issue to answer you as soon as possible, and in any case I will be back with more information after I sync up with my team.

@johnwatson484

I'm back with good news!

Turns out that we've encountered a similar issue before. What we do to solve this is to wait for the init container to authenticate before launching the application. Here's an example: link:

      initContainers:
      - name: wait-for-imds  # this container exits successfully when the IMDS endpoint returns 200 when asked for a
        image: busybox:1.31  # Key Vault token, guaranteeing IMDS is configured and ready before the test runs
        command: ['sh', '-c', 'wget "http://169.254.169.254/metadata/identity/oauth2/token?resource=https://vault.azure.net&api-version=2018-02-01" --header "Metadata: true" -S --spider -T 6']

How does that sound? If that works, please let us know.

(@chlowell was key to discover how we managed to solve this issue during our tests)

Hi @sadasant, thanks for taking the time to look at this for us.

I'll answer your original question first. The second request is just running the same code above again. In our exact context it's a Node.js web application that submits user data using that code to an Azure Service Bus queue.

Thanks for sharing the initContainer approach. What we ended up doing as a workaround is wrap the above code in a retry pattern similar to the simplified version below.

const credentials = new DefaultAzureCredential()
const client = new ServiceBusClient(myServiceBusInstance, credentials)
const sender = client.createSender(myQueue)
await retry(() => send(message, options), retries, retryWaitInMs, exponentialRetry)

async send (message, options) {
  await this.sender.sendMessages(message, options)
}

async function retry (fn, retriesLeft = 5, interval = 1000, exponential = false) {
  try {
    const val = await fn()
    return val
  } catch (err) {
    if (retriesLeft) {
      await new Promise(resolve => setTimeout(resolve, interval))
      return retry(fn, retriesLeft - 1, exponential ? interval * 2 : interval, exponential)
    } else {
      throw err
    }
  }
}

With the approach above if the first attempt fails it will works on the second retry so we're not risking the stability of our microservices.

But ideally the SDKs should handle this internally so users don't need to apply this pattern on every sending service.

I think this problem is caused by a combination the @azure/service-bus @azure/identity SDKs in a cluster using AAD Pod Identity.

One of our team has spent some time investigating where the issue lies in this combination and I think it's worth sharing his findings here to help consider improvements that could be made to the SDKs to avoid the need for work arounds.

The following is all thanks to @paulsimonandrews so I take no credit for the hard work!

#

Taking this simple script as an example.

const { DefaultAzureCredential } = require('@azure/identity')
const { ServiceBusClient } = require("@azure/service-bus")
async function example () {
  const cred = new DefaultAzureCredential()
  const serviceBusClient = new ServiceBusClient('sndffcinfsb1001.servicebus.windows.net', cred)
  try {
    console.log('Asking for token ...')
    const token = await cred.getToken('https://servicebus.azure.net/.default')
    console.log(token)
  } catch (e) {
    console.log('Failed to get token')
    console.log(e)
  }
  process.exit()
}
example()

As we know this will fail. Interestingly if you run it repeatedly it will fail each time, which I didn’t expect. I expected that it would work second time. It doesn’t, but more on that later.

The problem seems to be related to this line -> https://github.com/Azure/azure-sdk-for-js/blob/a697af0cde28f97d75021bbc908d49de08ce12a4/sdk/identity/identity/src/credentials/managedIdentityCredential/imdsMsi.ts#L78

As mentioned above, identity exposes the timeout via configuration you can pass to getToken function. Obviously this isn’t really documented, so I looked at the code and discovered if you change the getToken call above to:

const token = await cred.getToken('https://servicebus.azure.net/.default', {requestOptions: {timeout: 1000}})

It set the timeout out to 1000 rather than the default of 500 and it successfully gets a token first time.

The documentation for the azure identity getToken function insinuates that options such as this should be set by the consuming library, in this case @azure/service-bus.

I looked at the service bus code, which calls getToken with the resource (https://servicebus.azure.net/.default) but pass no options object, hence it will always default to the timeout of 500. If service bus allowed you set that config it would help.

I found the call to getToken in the servicebus code in my node_modules and inserted the options object in there and low and behold it worked.

Now, there is a counter argument to it's an issue entirely in the service bus SDK.

If you look at the link above for where it does the timeout in the azure identity code I think is just pinging the endpoint to see if there is a response within a certain amount of time then determining the availability on that. This is in an isAvailable function. If that fails the identity library doesn’t even try to request an access token via it’s getToken function.

My assumption is that if you by-pass the isAvailable call it would all just work. For some reason on the AKS cluster it must be that the first call to isAvailable fails, but a second successive call to isAvailable succeeds. It then tries the getToken which succeeds also (via a Promise). It’s a tricky one, as if you didn’t have the isAvailable check and the Managed Identity isn’t available the Promise in getToken may never resolve.

So there is something about the way aad-pod-identity works that makes the isAvailable call slower than it probably is on a “normal” usage of managed identity. So you could argue that the isAvailable logic (with it’s short default timeout) doesn’t work “out of the box” with aad-pod-identity.

I’d argue the default timeout in @azure/identity should be longer, and service-bus (and any other consuming lib) should expose the configuration that identity provides.

I mentioned above that if I run the script laid out above twice in quick succession, both will fail to get a token. This surprised me. If you update the script to have two calls to getToken in the same script, the first call fails, the second succeeds. Run the script again straight after, and the first call fails and the second succeeds:

const { DefaultAzureCredential } = require('@azure/identity')
async function example () {
  const cred = new DefaultAzureCredential()
  try {
    console.log('Asking for token ...')
    const token = await cred.getToken('https://servicebus.azure.net/.default')
    console.log(token)
  } catch (e) {
    console.log('Failed to get token')
    console.log(e)
    console.log('Asking for token again ...')
    const token = await cred.getToken('https://servicebus.azure.net/.default')
    console.log(token)
  }
  process.exit()
}
example()

I even wrapped the second call in a setTimout and waited for a couple of minutes, and it still succeeds. I then tried doing a second call for a token using a new instance of DefaultAzureCredential (const cred2 = DefaultAzureCredential()) and that succeeds too.

So, my conclusion is that there is some state in the library that is changing after the first call to isAvailable to make the second call work, even if it is minutes later.

It can’t just be that the managed identity isn’t associated to the pod in time.

So here a separate init container that checked for the existence of the manged identity wouldn’t help (I assume).

I am now slightly confused, but it is fair to say that the first call to get an identity token will always fail on the microservice coming up, but subsequent calls will succeed. My main question now is, are there any circumstance under which that isAvailable would run again and potentially fail (not sure if I have mentioned it above, but once isAvailable succeeds once, that information is cached and subsequent calls to getToken seem not to run isAvailable, but is that always the case)?

#

I hope this is helpful in understanding the issue and how it may be solved so others don't have to overcome these issues.

@johnwatson484 , @paulsimonandrews thank you both for the great feedback!

There's a lot to unpack here, so give me a bit of time to write something back. This has been already extremely useful feedback in any case! Thank you for the time being. I'll provide a more extensive answer in a bit.

Here's what I can gather from what we've talked so far:

[...] identity exposes the timeout via configuration you can pass to getToken function. Obviously this isn’t really documented [...]

Fun fact! Although we don't generally recommend to can call getToken directly, this is a completely valid use case. Your code is indeed a good solution, I believe!

[...] the service bus code, which calls getToken with the resource (https://servicebus.azure.net/.default) but pass no options object, hence it will always default to the timeout of 500. If service bus allowed you set that config it would help.

For the service-bus updates, I've synced with @HarshaNalluru and he already mentioned several great ideas that can improve this experience. He mentioned he'll chime in next week.

My assumption is that if you by-pass the isAvailable call it would all just work. [...]

We set isAvailable to acknowledge the availability of the credential only one first time, since we try to avoid having to do this discovery every time a new authentication is necessary. This is by design, but I'm sending your feedback to the team to see how we can improve the experience.

I’d argue the default timeout in @azure/identity should be longer [...]

This is another great feedback! I'm passing this through to my team. I'll report a summary of their responses when I get them.


While we process your feedback, please consider both that the approach you've found to work is valid, and that the initContainers approach is available in case you'd prefer it.

Once again, thank you for your feedback! We'll reply back once we have more information from our team.

I've been getting answers from my team regarding implementing some form of retry mechanism. The issue there is that we have no way of knowing when the resource will be available. This uncertainty might result in a worse experience since some users might keep waiting for something to happen when their authentication resource might never respond.

Regarding default timeouts, other languages have even shorter timeouts than we do. The concern here is that increasing the timeout will slow down all of our attempts to confirm or deny that an authentication method is available, and DefaultAzureCredential is arguably already bulky enough.

I believe the good call here is to ensure service-bus passes this timeout through, so it can be configured.

We will get more information from the service-bus team next week.

I've been getting this information from: @chlowell .
Charles, I'm tagging you in case you want to add or present the information in a different way.

I'm also tagging @schaabs in case he wants to chime in.

Thanks for the detailed comments @johnwatson484.
I'll look into the parts that need to be added to service-bus.

@johnwatson484 , @paulsimonandrews

Another important point, the isAvailable cache happens once per instance of the ManagedIdentityCredential, so you should be able to move the credential creation into the scope of the method where you call sendMessage, so that isAvailable is fresh each time. I'd be curious if that happens to help too. It would be even more interesting if that wouldn't help. Up to you both to try.

Inorder to pass the requestOptions through service-bus to getToken, I had started this PR https://github.com/Azure/azure-sdk-for-js/pull/13539 and that led to more discussions with the team.
We eventually landed on the model where the credential constructor will be adding the new requestOptions to the existing options bag, through which the timeout at getToken for that credential can be configured.

More info here https://github.com/Azure/azure-sdk-for-js/issues/13940, @sadasant will be working on @azure/identity for this.

Thank you @HarshaNalluru

@johnwatson484 The solution we recommend is what we shared above, to add this to your architecture (where we use it):

      initContainers:
      - name: wait-for-imds  # this container exits successfully when the IMDS endpoint returns 200 when asked for a
        image: busybox:1.31  # Key Vault token, guaranteeing IMDS is configured and ready before the test runs
        command: ['sh', '-c', 'wget "http://169.254.169.254/metadata/identity/oauth2/token?resource=https://vault.azure.net&api-version=2018-02-01" --header "Metadata: true" -S --spider -T 6']

Adding retrying mechanisms in our libraries for authentication is something we're trying to avoid. We will be coordinating with our team to document this recommendation more visibly to our users in general. I've made an issue to follow up on the documentation side: https://github.com/Azure/azure-sdk-for-js/issues/13948

Once again, thank you for your time making this issue. Please let us know if we can help with anything else! Take care.

Was this page helpful?
0 / 5 - 0 ratings