Azure-docs: AzureDisk backed PersistentVolumes causing moved Pods to take very long to start up

Created on 18 May 2020 · 9Comments · Source: MicrosoftDocs/azure-docs

I used the instructions here to allow for the Pods in Kubernetes App to have Persistent Volumes backed by Azure Disks that I provisioned separately. All worked exceptionally well -- until I enable the Cluster Autoscaler. Now, when I remove enough Pods from my AKS cluster that the Cluster Autoscaler decides to remove a Node from the VM Scale Set, Pods on that node that need to be rescheduled have to wait for the node to shut down before it releases their volumes! I see this error: "Multi-Attach error for volume "<volume name>" Volume is already exclusively attached to one node and can't be attached to another" for the pod -- until the Node shuts down, at which point it can finally remount the volume. Please mention this in the docs and if possible, provide a link to a workaround. Thanks.

Document Details

⚠ Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

ID: 74be7378-92bc-0e59-0b70-c6b10e80d9b6
Version Independent ID: f7f1eeb3-efe7-86e2-1fd7-04341b2c77cc
Content: Create a static volume for pods in Azure Kubernetes Service (AKS) - Azure Kubernetes Service
Content Source: articles/aks/azure-disk-volume.md
Service: container-service
GitHub Login: @mlearned
Microsoft Alias: mlearned

Pri2 container-servicsvc cxp product-question triaged

Source

emacdona

Most helpful comment

Unfortunately current slowness is on Azure Compute(CRP) level, here are the main issues:

Disk attach/detach latency is high
Cluster scale up is slow
- Parallelization of disk attach/detach in VMSS/VMAS is now only 3

CRP team are working on this, current target date is around Oct this year.

Also, there is a new vhd disk feature based on azure file which could attach/detach disk < 1s, consider that as an option if user really has concern about disk attach/detach time cost: https://github.com/kubernetes-sigs/azurefile-csi-driver/tree/master/deploy/example/disk

andyzhangx on 21 May 2020

👍2

All 9 comments

@emacdona Thanks for the question! We are investigating and will update you shortly.

SumanthMarigowda-MSFT on 19 May 2020

👍1

@emacdona, could you share the version of AKS cluster you are seeing the error on.

djsly on 19 May 2020

Did you mean Kubernetes version? If so:
kubectl version Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.4", GitCommit:"8d8aa39598534325ad77120c120a22b3a990b5ea", GitTreeState:"clean", BuildDate:"2020-03-12T23:41:24Z", GoVersion:"go1.14", Compiler:"gc", Platform:"darwin/amd64"} Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.10", GitCommit:"059c666b8d0cce7219d2958e6ecc3198072de9bc", GitTreeState:"clean", BuildDate:"2020-04-03T15:17:29Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}

If not, how do I find the version of AKS that I'm using?

emacdona on 20 May 2020

that's good enough. v1.15.10 is your Kubernetes version you selected when you installed AKS.

how many disks are getting detached when the node scales down ?

https://docs.microsoft.com/en-us/azure/aks/troubleshooting#large-number-of-azure-disks-causes-slow-attachdetach

HEre's a List of known issues with version fix and also existing issues.

https://github.com/andyzhangx/demo/blob/master/issues/azuredisk-issues.md#25-multi-attach-error

Note that when 1.15.10, there are still a few issues that were fixed after this release.

https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.15.md#other-bug-cleanup-or-flake

Fix: add remediation in azure disk attach/detach (#88444, @andyzhangx) [SIG Cloud Provider]
Fix: get azure disk lun timeout issue (#88158, @andyzhangx) [SIG Cloud Provider and Storage]
Add delays between goroutines for vm instance update (#88094, @aramase) [SIG Cloud Provider]

https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.15.md#bug-or-regression

For volumes that allow attaches across multiple nodes, attach and detach operations across different nodes are now executed in parallel. (#89241, @verult) [SIG Apps, Node and Storage]

djsly on 20 May 2020

👍1

how many disks are getting detached when the node scales down ?

https://docs.microsoft.com/en-us/azure/aks/troubleshooting#large-number-of-azure-disks-causes-slow-attachdetach

So, I'm running 25 StatefulSets, each with one Replica -- with a PVC connecting to a PV backed by an AzureDisk. The NodePool starts with 2 nodes and the Cluster Autoscaler is configured with a min of 2 nodes and max of 5. Just giving some context for the next statement...

When I deploy the app, it scales up to 4 nodes, so I'm guessing it puts either 6 or 7 Pods per Node. So, under the 10 volume threshold in the bug mentioned above.

What's REALLY interesting about this is that it appears the Autoscaler is trying to work around the issue... because my first attempt at fixing the problem was switching to Azure File Shares. When I did that, the Autoscaler took a VERY long time to decide to go from 2 nodes to 3 nodes. In the meantime, the app was unusable because the Pods kept failing to come up and were continuously restarting. But that's another issue for another time. Only mentioning it because the Autoscaler seems to behave differently for AzureDisks vs AzureFiles.

HEre's a List of known issues with version fix and also existing issues.

https://github.com/andyzhangx/demo/blob/master/issues/azuredisk-issues.md#25-multi-attach-error

This looks promising, but the workaround seems only for Rolling Updates. I'm experiencing the problem during an Autoscaler Scale Down event. I'll research further and see if it still applies, though.

Thanks!

emacdona on 20 May 2020

Actually, the more I look at that workaround, the more it looks like it will prevent the actual error message from happening -- but do nothing to shorten the amount of time required to move a pod from one node to another.

Sorry, I'm in a state where I can't test that theory at the moment. When I do get a chance to test it, I'll report back :-) (Probably tomorrow morning).

emacdona on 20 May 2020

Unfortunately current slowness is on Azure Compute(CRP) level, here are the main issues:

Disk attach/detach latency is high
Cluster scale up is slow
- Parallelization of disk attach/detach in VMSS/VMAS is now only 3

CRP team are working on this, current target date is around Oct this year.

andyzhangx on 21 May 2020

👍2

Thanks for the info! I'll check out the vhd disk feature.

emacdona on 21 May 2020

Thanks @andyzhangx
@emacdona Thanks for bringing this to our attention. We will now close this issue. If there are further questions regarding this matter, please tag me in a comment. I will reopen it and we will gladly continue the discussion.