Azure-cli: az commands can trigger 429 / "too many requests" failures and provides no recourse for recovery.

Created on 17 Jan 2020 · 16Comments · Source: Azure/azure-cli

Describe the bug

Running az commands can generate 429 "too many requests" exceptions from Azure (possibly related to az aks? or possibly all commands -- I've definitely seen this at random from Azure before). It seems this happens with long running commands after they have already executed and az is polling for a result from Azure.

Ideally, when this happens, az should just exponentially backoff (i.e. increase the timeout and try again). (Sometimes in the 429 response, there is even a Retry-After header that tells you exactly how long to wait!)

IMO, the REAL issue is that, you get back a failure message, and the command aborts, with no results -- even if the command was successful (e.g. you can't even just try to rerun the command at that point). -- Basically, the command shouldn't throw a perma-error unless it has actually failed. If the command is still running and might possibly succeed but you just failed to poll for a result, you should do a backoff and retry.

Command Name
az aks nodepool add --resource-group MyResourceGroup --cluster-name MyClusterName --os-type Windows --node-vm-size "Standard_B2s" --name window --node-count 2 --kubernetes-version 1.13.12 --min-count 2 --max-count 6 --enable-cluster-autoscaler

Errors:

WARNING: The behavior of this command has been altered by the following extension: aks-preview
ERROR: Deployment failed. Correlation ID: de22582b-9a0c-462b-b15a-7fd3d85d07e2. VMSSAgentPoolReconciler retry failed: autorest/azure: Service returned an error. Status=429 
Code="OperationNotAllowed" Message="The server rejected the request because too many requests have been received for this subscription." Details=[{"code":"TooManyRequests","message":"{\"operationGroup\":\
"HighCostGetVMScaleSet30Min\",\"startTime\":\"2020-01-17T17:29:36.1768987+00:00\",\"endTime\":\"2020-01-17T17:44:36.1768987+00:00\",\"allowedRequestCount\":1329,\"measuredRequestCount\":1419}","target":"H
ighCostGetVMScaleSet30Min"}] InnerError={"internalErrorCode":"TooManyRequestsReceived"}

To Reproduce:

Steps to reproduce the behavior.

Run a long running command that continually polls azure for a result while your subscription is under heavy load (possibly from other such commands running in parallel?), until an http response with a status of 429 ("Too many requests") is returned by the Azure API that is being called.

Expected Behavior

Az.exe shouldn't fail when the initial command turns out to be successful -- because it leaves the user in an unrecoverable state (e.g. the initial command appears to have failed, there is no output results, and re-running the command also fails, because now the resource exists! -- so you not only don't handle the 429 yourself, but you prevent the user from handling it too!).
Specifically, calls to Azure made by Az.exe which return a 429 status should have transient fault handling baked in -- as specified by MSDN, best practices for cloud applications: All applications that communicate with remote services and resources must be sensitive to transient faults.

Environment Summary

Windows-10-10.0.18362-SP0
Python 3.6.6
Shell: cmd.exe

azure-cli 2.0.80

Extensions:
aks-preview 0.4.27
application-insights 0.1.1

Additional Context

AKS Core Service Attention customer-reported

Source

BrainSlugs83

Most helpful comment

FYI -- for anyone experiencing this issue: https://github.com/Azure/AKS/issues/1413

BrainSlugs83 on 28 Jan 2020

👍2

All 16 comments

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @Azure/aks-pm

msftbot[bot] on 19 Jan 2020

👍1

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @Azure/aks-pm

msftbot[bot] on 19 Jan 2020

Hey I wanted to bump this with some more information -- this is still occurring and my team and I suspect the az commands themselves are what's generating too much traffic; I suspect they are polling too often to try and get the result of the operation they performed.

Is there a setting or a part of the code I can look at to see where the polling for the result is happening? -- I'd like to be able to track this issue down; as it is kind of making AKS completely unusable.

BrainSlugs83 on 22 Jan 2020

Azure CLI internally uses msrestazure and msrest for long running operation polling.

Judging from the error, the throttling is triggered by HighCostGetVMScaleSet30Min. I checked the code of msrestazure but it doesn't seem to handle throttling error 429's Retry-After from Compute Resource Provider.

https://github.com/Azure/msrestazure-for-python/blob/0a07586816351e3b7abd60a55a8b736e007fb7a8/msrestazure/polling/arm_polling.py#L128-L140

    def _raise_if_bad_http_status_and_method(self, response):
        """Check response status code is valid for a Put or Patch
        request. Must be 200, 201, 202, or 204.

        :raises: BadStatus if invalid status.
        """
        code = response.status_code
        if code in {200, 202} or \
           (code == 201 and self.method in {'PUT', 'PATCH'}) or \
           (code == 204 and self.method in {'DELETE', 'POST'}):
            return
        raise BadStatus(
            "Invalid return status for {!r} operation".format(self.method))

@lmazuel, could you please help confirm?

At the same time, I am reaching to ARM and AKS team to check the root cause for the throttling.

jiasli on 23 Jan 2020

👍1

This 429 throttling is a known issue with AKS calling VMSS, which we are investigating as part of the service issue.

A support request is in flight, SR: 120012323000897.

I am going to close this issue as it is not CLI related and we have oncall investigating.

jluk on 23 Jan 2020

@jluk -- regardless of the service request. I believe that the command line tool is also not handling 429 responses appropriately, according to MSDN (which I linked above). And it still leaves the user with no recourse, which is unfortunate.

Obviously it would be great if the service issue was solved, but regardless, I believe that the tool may be behaving inappropriately (per the above); and I would recommend you leave this bug open and part of the backlog (if 429 was handled appropriately, it could help to mitigate issues like this in the future).

By the way, regarding the SR, is there any ETA on fixing this issue? -- I have my own SR open, but it's been 6 days and I've yet to hear of any progress on the issue.

BrainSlugs83 on 24 Jan 2020

Since cli core is already tagging an owner to investigate, will reopen and defer to them to handle issue closure.

The VMSS throttling is still under investigation, we've made multiple improvements in the cloud provider for latest 1.15 patch versions and latest CA addon versions. There is an additional set of work we are engaging for ETAs, please look to azure/aks for an issue summary in the coming days.

jluk on 24 Jan 2020

❤1 👍1

Thanks @jluk, much appreciated! 🙂

BrainSlugs83 on 24 Jan 2020

Similar here:

VMSSAgentPoolReconciler retry failed: autorest/azure: Service returned an error. Status=429 Code=\"OperationNotAllowed\" 
Message=\"The server rejected the request because too many requests have been received for this subscription.\" 
Details=[{\"code\":\"TooManyRequests\",\"message\":\"{\\\"operationGroup\\\":\\\"HighCostGetVMScaleSet3Min\\\",\\\"startTime\\\":\\\"2020-01-27T17:20:31.4086198+00:00\\\",\\\"endTime\\\":\\\"2020-01-27T17:22:31.4086198+00:00\\\",
\\\"allowedRequestCount\\\":200,\\\"measuredRequestCount\\\":472}\",\"target\":\"HighCostGetVMScaleSet3Min\"},
{\"code\":\"TooManyRequests\",\"message\":\\{\\\"operationGroup\\\":\\\"HighCostGetVMScaleSet30Min\\\",\\\"startTime\\\":\\\"2020-01-27T17:20:31.4086198+00:00\\\",\\\"endTime\\\":\\\"2020-01-27T17:45:00+00:00\\\",\\\"allowedRequestCount\\\":1002,\\\"measuredRequestCount\\\":3030}\",\"target\":\"HighCostGetVMScaleSet30Min\"}] 
InnerError={\"internalErrorCode\":\"TooManyRequestsReceived\"}

the only difference is that two errors thrown, for HighCostGetVMScaleSet30Min and for HighCostGetVMScaleSet3Min. I was running terraform to add a node pool.

Found issue in Azure health center, will try disabling autoscaler for now

mlushpenko on 27 Jan 2020

👍1

Also of note -- the allowedRequestCount is crazy low today -- 200 for you -- and 270 for me (before it was ~600, then ~1300, now ~200). -- Not sure why the allowed request count is going so crazy.

EDIT: My bad: there are multiple allowedRequestCounts in the response -- looks like our values are similar. Nevermind. 🤔

BrainSlugs83 on 27 Jan 2020

FYI -- for anyone experiencing this issue: https://github.com/Azure/AKS/issues/1413

BrainSlugs83 on 28 Jan 2020

👍2

I have raised a support ticket to microsoft and got the below recommendation which worked for now. microsoft is working to fix in later version of aks.

remove autoscaling feature for aks clusters in subscription and wait for 2 hours to api to lower the throtlle.
keep resource limits for all deployments and make sure its not more than 80% of total available resources.

kalyan123i on 30 Jan 2020

As discussed with the author of msrest:

429 is honored as part of the regular requests/urllib3 integration:
https://github.com/urllib3/urllib3/blob/728d9244665ef5b03103cb74d7b409ebe4f23b43/src/urllib3/util/retry.py#L154

It will be honored only if there is a Retry-After header though, as per HTTP spec. You will not find this code in the ARM implementation, because it’s not ARM specific (ARM polling will contain only code that are relative to ARM interpretation of HTTP).

If your 429 got a Retry-After, they likely have been exhausted before you actually see the error. Receiving a 429 exception doesn’t mean you didn’t retried, it means you’re out of retries.

If you reach 429 because of ARM quota limit and you followed all your retries and Retry-after, there is nothing much the SDK can do except failure. Think about it and the possible implication if we try to do something, if you reach your hourly quota, what would you expect the SDK to do then, wait for 30 minutes until the quota is back? Especially if there is parallel usage of this subscription, it’s not even a guarantee that you get the quota back at the end of the calculated time. And then, you wait even more? https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/request-limits-and-throttling#error-code

We will drive AKS team to fix the polling issue https://github.com/Azure/AKS/issues/1413 on the service side.

jiasli on 31 Jan 2020

Hey thanks for responding, and following up with the AKS team. -- Apologies for the long-wait, we were engaging with Azure support finally, and this fell off my radar.

FYI: It looks like the original issue was not that the service was reporting 429 back to az.exe -- it was the AKS load balancer was encountering 429 issues on it's own -- and that was being reported to the az.exe tool. -- It's odd though, since the deployment might have succeeded, with the load balancer in a good state, but that the command would fail with that issue; it leaves a scripter in a very awkward state (see item 3 below).

But, with that in mind, it may be that the tool is handling these requests appropriately and that this bug is moot. (I'm unable to confirm either way. -- But if y'all believer 429 is handled appropriately by az.exe, then by all means, feel free to re-close this bug.) 🙂
FYI: per the RFC 6585 Retry-After is optional (as indicated by the word MAY in all uppercase characters). -- Don't get me wrong; it's really nice when it's present, and I'm of the opinion that if you write a service that returns 429, you really should include it, but from an "HTTP spec" perspective, it's 100% optional. -- It doesn't mean you are out of retries, it means you are being throttled; you've made too many requests, and you should retry again later. The response has the ability to define when that later is via Retry-After, but it's not required to do so.

Generally, when later is not defined, in a long running service environment, we use exponential back-off (as documented in the MSDN documentation that I previously linked). -- Because an extra 30 minutes (or even 3 hours), is no big deal. -- The usual strategy we use with Azure is that if a Retry-After is present, we honor it, and if not, we use the exponential back-off strategy. (This was even recommended to us by Azure engineers in the past, and it works well in my experience -- we ran a service that managed, and provisioned several hundred ARM VM environments a day with that strategy.)
A good fall-back strategy if you need to "give up" after a certain amount of time or retry intervals, is to make the commands idempotent. -- So that a user can retry the command later on their own, and resume where they left off -- this was the other part of the issue that we encountered. -- Basically a command failed, and even waiting a day and re-running it, it would still fail, because the command was not idempotent. (Perhaps adding a --force or other switch could fix that in this case.)

BrainSlugs83 on 10 Feb 2020

As for item 2, I think it is hard-coded in msrestazure to only retry 4 times. @lmazuel, do you think if it is possible to expose this config so that users can control the behavior? Maybe via environment variables.

As for item 3, indeed it is good practice to make az commands idempotent. Commands like az storage account create does follow this pattern. @jluk, is it possible for your team to make this command idempotent as requested?

jiasli on 17 Feb 2020

❤1

I have a similar problem happening with az storage list. The command will fail with a 429 message and does not appear to respect the throttling:

ValidationError: ("{\"error\":{\"code\":\"ResourceCollectionRequestsThrottled\",\"message\":\"Operation 'Microsoft.Storage/storageAccounts/read' failed as server encountered too many requests. Please try after '17' seconds. Tracking Id is '78db6dda-b7da-45cf-b763-202cdedf94f4'.\"}}")

~~This cascades down and breaks items that depend on the Azure CLI such as Terraform.~~

While I believe the source of the problem is the actual backend (the list request is only 4 actual requests), the Azure CLI should have a way to handle this gracefully.

Edit: Removed the Terraform note. They're using the REST / Go SDK so it appears to be the same issue with simply listing the accounts but not directly related.