Kubernetes: If kubelet is unavailable, AttachDetachController fails to force detach on pod deletion

Created on 23 Jun 2018  路  116Comments  路  Source: kubernetes/kubernetes

/kind bug

What you expected to happen:
When a pod using an attached volume is deleted (gracefully) but kubelet in the corresponding node is down, the AttachDetachController should assume the node is unrecoverable after a timeout (currently 6min) and forcefully detach the volume.

What happened:
The volume is never detached.

How to reproduce it (as minimally and precisely as possible):

  • Create a pod with a volume using any of the attachable plugins (I used GCE PD to test).
  • Stop kubelet inside the node where the pod is scheduled.
  • Delete the pod.
  • Wait for 6min+
  • Check to see if the volume is still attached.

Anything else we need to know?:
This doesn't happen if the pod is force-deleted.

It's likely due to the last condition checked in this line. Once kubelet is down, the container status is no longer reported correctly. Inside the Attach Detach Controller, This function is called by the pod update informer handler, which sets whether the volume should be attached in the desired state of the world. On pod force deletion, the pod object is deleted immediately, and this triggers the pod delete informer handler, which doesn't call this function.

/sig storage
/cc @saad-ali @gnufied @jingxu97 @NickrenREN
/assign

kinbug lifecyclfrozen prioritimportant-soon sistorage

Most helpful comment

@msau42 I can only speak for my own experience here, but I've seen behavior like this when the underlying hardware for a VM running a node fails. The cluster admin never has the opportunity to cordon/drain the node first in that scenario.

This prevents the volume from being reattached to another node even if the cloud provider knows that the instance is powered down (and therefore that the volume definitely isn't in active use on that node) as to my knowledge there isn't a node taint that lets us signal this status to k8s.

It's less of an issue if you have node autoscaling available (as the volume can be attached to another node as soon as the failed node is deleted), but when that's not an option, as far as I know any pods with attached PVs can't be rescheduled until the node comes back online or is removed from the cluster entirely. (It's been a while since I've run into this myself and looked into it though, if that's no longer the case I'd be glad to be wrong about this!)

All 116 comments

[MILESTONENOTIFIER] Milestone Issue: Up-to-date for process

@verult


Issue Labels

  • sig/storage: Issue will be escalated to these SIGs if needed.
  • priority/important-soon: Escalate to the issue owners and SIG owner; move out of milestone after several unsuccessful escalation attempts.
  • kind/bug: Fixes a bug discovered during the current release.


Help

Maybe I hitted a similar case but not the same.

The general sequence as below :

  • Config volumes.kubernetes.io/controller-managed-attach-detach in order to have the attach/detach controller being in charge of attaching anddetaching operations both;
  • Create a statefulset with volumes provisioned by glusterfs;
  • Use iptables to simulate a master-worker network partition failure. As a result, kubelet on worker node can't talk against apiserver at the moment;
  • NodeLifecycleController subsequently detect the unreachable node and evict the pods on the unreachable node;
  • The evicted pods will be recreated and scheduled to another available Node with previous volumes.

The attach/detach controller works as expected. I use version 1.9.3.

@orainxiong is the terminationGracePeriod set inside the statefulset?

terminationGracePeriod is 0 if I remember correctly. I guess it might be the root cause.

@saad-ali can we have an update whether someone is working on this issue?

I can take a look at this

/assign

@saad-ali, @verult and I scheduled a meeting to talk about this issue on Monday 27Aug 11am (New York, USA Time).

Hi @saad-ali , @verult and @gnufied.

I have a concern about how to fix this issue.
Regarding the implementation of https://github.com/kubernetes/kubernetes/pull/67847 \ https://github.com/kubernetes/kubernetes/pull/67419 , I am not sure if the PR design is save enough.
The solution that was proposed in this PR was to leverage the DeletionGracePeriodSeconds instead of using the podStatus.ContainerStatuses (HERE).

Its good not to trust on podStatus.ContainerStatuses because in a crash worker scenario the kubelet is not responsive and the ContainerStatuses will remain Running for-ever (which prevent detach the POD). So adding DeletionGracePeriodSeconds as another trigger to force detach the PVCs of a none responding POD is good idea.

BUT I think we should be very careful of force detach PVCs, because if its a split brain scenario (worker node was not really crashed, it was just a network issue) and the POD still running and writing IO to the PVC, detaching the PVC may lead to data corruption.
So maybe its too brutal to trust on this DeletionGracePeriodSeconds (AFAIK the customer facing of this timeout is TerminationGracePeriodSeconds which is by default 30 seconds if you didn't specify it in the pod spec.). So if we take this PR implementation there could be application that will get force detach in split brain and maybe these application cannot handle such detach.

I would like to propose saver implementation that from one hand will do detach for PODs that in crashed node scenario, but from other hand will be save enough also for split brain scenario.
The propose is to add a new annotation on the POD spec (e.g : AllowForceDetachAfterGracePeriod with default false), and then only if the customer specify it in the pod spec with true value,
then k8s will force detach PVCs of this POD, but if its false then even after DeletionGracePeriodSeconds it will not force detach the POD - which keeping the existing behavior - so customer will have to manually delete the POD (do it in save and manual way to prevent data lose).
So by adding this safety annotation we provide the customer the ability to decide if this pod can recover from detach operation. And then you will fill save to also chary pick this fix to older k8s versions, because you will not impact any existing behavior(Only PODs with this annotation will get the new behavior).

Thoughts?
Let me know if its over kill, or maybe its saver for the user to avoid potential data lose.

@gnufied and I discussed this offline. Please correct me if I'm wrong, Hemant, but your argument was that setting the annotation to true is equivalent to setting TerminationGracePeriod very high, which also "disables" this detach behavior by delaying it indefinitely.

On the other hand, because the default TerminationGracePeriod is 30s, we can't safely cherry-pick this PR back to older versions.

One option is to use an annotation + DeletionGracePeriod for all versions prior to 1.12, and only DeletionGracePeriod for 1.12 and later.

One of the other things we could do is - only detach volumes from a node when it is known that it has been shutdown. So basically make use of newly introduced taint to detach the volume. This does mean that volumes never get detached from unresponsive or crashed kubelet nodes but that is probably fine. If user really wants to force the detach he can always do that via:

~> kubectl delete pod <foo> --force=true --grace-period=0

This will mean, we will have to roll this PR into https://github.com/kubernetes/kubernetes/pull/67977 and we obviously can't backport.

Older versions of AWS, Openstack and Vsphere is not affected by this bug btw - because in those cloudproviders a node is deleted when shutdown. It is only recently that this behaviour was changed in those cloud providers and node object no longer is deleted and hence pods remain in "Unknown" phase and volume never gets detached.

This bug affects versions as far back as 1.9 (or even before that). But cloudprovider behaviour is recent change. Another problem is - using shutdown to detect volume detach has been feature gated as an alpha feature,

I do not like annotation option that much tbh. It is hard to document and easy to miss and since it is an API change, it also has to go through alpha, beta phases. Agreed that - https://github.com/kubernetes/kubernetes/pull/67977 is currently feature gated too but I do not see a strong reason of why that should be feature gated. cc @yastij

@gnufied - some production cluster relies on quick reboots (i.e. reboot and still have disks attached). We might want to give them some bandwidth for transition.

@yastij shouldn't pod eviction timeout(5 minute default) + expiry window before a node is considered lost and it stops sending heartbeats enough? There is also - grace period obviously.

The reason I do not particularly like annotation also is because, detaching volumes from shutdown nodes should be a default behavior and should not require user intervention. If we can differentiate between shutdown nodes and unresponsive kubelets then we should use that information for detaching volumes and it should be default rather than something that user have to configure.

I agree with @gnufied on annotations.

the plan is to promote this as beta on 1.13. From what I鈥檝e see the eviction timeout impacts all pods which might be tricky: for aggresive clusters this doesn鈥檛 work as pod timeout and node grace period are set lower than usual. This window (1 release) ensures that everything goes smoothly.

Good arguments on both sides. On the one hand, we have to very careful not to cause data corruption. On the other hand, we don't want to get in to a state where their data is stuck and requires manual intervention to access from new workloads.

I agree with @gnufied, we should avoid adding another annotation -- whenever there is a question like this, exposing the option to the end user is the easiest way out, but not necessarily the best. We should try to be smart and do the right thing automatically on behalf of the user.

The right thing in this case is tricky. If a node is not coming back, we should move the disk even if it is not safely unmounted. But how do we differentiate that from a split brain? A timer based algorithm is pretty crude, but its the best approximation we currently have. And I'm not aware of any way to detect a split brain vs a dead node so I think we need to weigh the likely hood of the different scenarios happening and the worst case for each option and how willing we are to tolerate that.

I'll leave it up to you guys to weigh those tradeoffs.

@gnufied actually in openstack nodes are not deleted anymore. https://github.com/kubernetes/kubernetes/pull/59931 It basically means that in openstack if you shutdown instance which had pod with pvc, it cannot failover to another instance.

However, #67977 will fix this issue

I just talked to @yastij that we should use feature gate only to detach delay. Then it can detach volumes in openstack again, if feature gate turned on the delay is something like 1min and if not turned on the detach delay is something like 8-10minutes.

You are correct like in AWS this problem does not exist but if #59930 is approved (like it should because other cloud behaviour and spec is that it should not delete nodes). Then aws will get this problem as well if #67977 is not approved.

Some notes I took during the conference meeting today (@verult please add your notes as well):

Do we have any notion of the timeframe on resolving this issue? Code freeze is next week, at which point only critical/urgent bugs will be permitted into the milestone.

Thank you! :)

@shay-berman PR #67977 should still be on hold because without actual handling of pods from shutdown nodes the PR #67977 has no effect whatsoever. The volumes still will not be detached from shutdown nodes.

cc @msau42 @jingxu97

@gnufied yes it has? Lets say that we have 2 nodes: node1 and node2. Then we have deployment called foobar which have pvc in it. foobar is now running in node1, and then node1 is shutdown.

The result is that node1 foobar pod is going to unknown state and volume is detached using volume danglingerror. foobar pod will start in node2 and volume is attached (this takes now something like little bit over 1minute). What happens when node1 comes back? It will just terminate the foobar pod in node1 and does not touch to volumes anymore.

I just tested it :)

@zetaab, what happens if it wasn't shutdown?
I.e, container using the pvc is working and writing to the pvc.
Communication with kubelet has an issue for 1 hour, then comes back.

@zetaab dangling error should just add a volume to actual state of the world to handle the case of missed pod statuses. In itself - it should never cause volume detaches. If it does, we probably have a bug. I have tested similar scenario on AWS with latest patch(https://github.com/kubernetes/kubernetes/pull/66835) that prevents node object from being deleted and volumes are NOT detached.

If on openstack volumes are being detached from a shutdown nodes with pod objects still being around, then I would call it magic/bug. :-)

@itamarh like I commented to that PR it does not do anything if there is issue with connectivity (no taint). Kubernetes itself starts eviction (I have 1min eviction time), so after 1 minute it will try to create pod to another node. However, because volume is not deattached it cannot start.

I do not know should it force detach drive in case of connectivity issue or not. The behaviour before cloudprovider changes were that it will wait 6 minutes before force detach.

@gnufied its little bit weird that pods are then evicted to another node? Should some pod controller check "do this have volumes in node which is stuck"?

Lets think following scenario: I have pod with pvc located in node1, that node crashes and it cannot start anymore. How the kubernetes should behave?

Possible scenarios:

  • do nothing: wait until admin comes online (admin can for instance detach volume manually, and then pod will start in another node. Or he can just delete the broken machine, and volume is detached. Automation can also delete machines?)
  • detach drive in 6 minutes: I think this was the behaviour before cloudprovider updates?
  • ?

Just to add to what @shay-berman wrote the complete discussion is as follows:

  1. Both https://github.com/kubernetes/kubernetes/pull/67977 and https://github.com/kubernetes/kubernetes/pull/67847 should be on hold for now. Because as we discovered that #67977 will have no effect if pod is not being removed from DSWP(desired state of the world).

  2. The reason we chose to put https://github.com/kubernetes/kubernetes/pull/67847 on hold is because, any timing value we decide for force detaching volume from a pod in "Unknown" phase is fundamentally unsafe. This was recommendation from @derekwaynecarr and @liggitt too.

  3. We also discussed possibility of using "Shutdown" taint for removing pods from DSWP. The underlying idea is - if a node has "Shutdown" taint then all its pods that are either fully terminated or have DeletionTimeStamp set could be removed from DSWP and hence trigger detach. The problem with doing this in attach/detach controller is - attach/detach controller should use pod's status as single source of truth for performing detaches. It was pointed to us that stateful sets and replication controller backed pods may need to be special cased and hence if volumes MUST be detached from a node that is shutdown, that decision should be based on pod's status and should come from node/pod lifecycle controller. For example - in future node lifecycler controller may mark containers of a pod that are on a shutdown/poweredoff node as terminated/not running and hence attach/detach controller will automatically detach volumes from it. @derekwaynecarr @liggitt had some valid points against special casing this in attach/detach controller.

@shay-berman the PR we were going to keep was this one that already merged, which introduces the node shutdown taint: https://github.com/kubernetes/kubernetes/pull/59323

I sent my meeting notes to SIG-storage. Let me know if you have trouble accessing (my Slack is cxing). I believe you need to be a member of SIG Storage to access.

https://groups.google.com/forum/#!forum/kubernetes-sig-storage

@jingxu97 @msau42 and @random-liu from sig-node discussed offline about potentially having a higher level controller to force delete a pod. One of the possible issues is that there could be a race condition if the node comes back up soon after.

Imagine we have a StatefulSet pod that has a non-attachable volume, and suppose the shutdown node comes back up right before the pod is force deleted. Kubelet on the node sees the pod is still scheduled on it and starts the containers. Then the pod is force deleted, and a new pod gets started on a different node. Now there's a chance the same containers are running on both nodes.

For intentional shutdown of a node for maintenance, where highly responsive transfer of workloads to another node is desired, there's no substitute for cordoning/draining workloads from the node prior to shutdown.

Is the desire here to handle intentionally shutting down a node without terminating it, and without cordoning/draining it, while still quickly moving workloads and volumes to another node?

@liggitt - Yes, it may also cover shutting down a node when it may need to be migrated and re-deployed.

For an intentional shutdown of a node, how come a node drain cannot be done?

@msau42 - Indeed it should be. The feature covers both anyway.

My understanding is this is more of a problem for unintentional shutdown, to make available volumes previously attached to the node without causing data corruption.

It is a long discussion. I think we have several issues covered here,

When kubelet is not responsive, the states for the following objects are as follows,
Node: node condition is unknown
Pod: trying to evict pod after (5) minutes. Pod eviction normally will not succeed because kubelet will not response and pod object will still remain in the api server. This will cause statefulset to fail to start a new pod.
Volume: attach_detach controller will NOT detach volume because pod is considered not terminated (container status is still running). This will cause volume not be able to attach to a different node.

I think this is the correct states we want to have when the node condition is really unknown. But in the case of shut down, we should able to do something to improve the situation.

Now we have node shut down taint available. When cloud node controller detect the node is in shut down state, it will taint the node with shutdown condition. PR https://github.com/kubernetes/kubernetes/pull/67977 is trying to use this taint to fix volume detach problem.
After discussed with @yastij, we can make the following work

  1. Node controller update node status to remove volumeInUse after node is tained as shut down, (could also mark container not running)
  2. Attach_detach controller will try to detach volume if volume is not longer in use (removed volumeInUse from node status by step 1) without problem (it can pass safeToDetach check now).

I think https://github.com/kubernetes/kubernetes/pull/67977 can be considered as a bug fix instead of a feature. It does use node shut down taint introduced recently (it is not feature gated). It will not solve the problem for pod not being deleted from apiserver, but it can detach volume correctly. So could we push this PR first and then come up with some solution for pod deletion issue during node shut down? Thanks!

I think #67977 can be considered as a bug fix instead of a feature. It does use node shut down taint introduced recently (it is not feature gated). It will not solve the problem for pod not being deleted from apiserver, but it can detach volume correctly. So could we push this PR first and then come up with some solution for pod deletion issue during node shut down?

Making specific taints inform a "force detach" controller instead of scheduler/eviction logic doesn't seem safe. Coordinating on the existence of the pod object is what I expected.

Is there any precedent for using taints in this way?

The shutdown taint informing the container running condition makes more sense but is still racy

If the Pod object is not deleted, StatefulSet will not be able to recreate a replacement pod.

Node controller update node status to remove volumeInUse after node is tained as shut down, (could also mark container not running)

Is that before or after the volumes are actually released? Because a kublet failure after the taint is added and before the volumes are released would still be problematic.

I have far more experience with traditional HA systems than I do with k8s, but a core old-world principle is that you just don't know what you don't know. You can use heuristics and make assumptions, but some of the time they will be wrong and depending on the workload that might be problematic.

tl;dr - I would advocate for solving this with an integrated fencing approach rather than heuristics

We've been investigating how to best integrate the concept of fencing into k8s for some time due to exactly the type of scenario described here. Fencing isn't always a popular concept, but I like to think of fencing as a way to turn a question "Is X going to cause corruption or data loss)?" into an answer "No", while also enabling automated recovery within a bounded timeframe.

Our current intention is to piggy back on elements like node-problem-detector and the in-progress Cluster/Machine APIs to create a more featured version of https://github.com/kubernetes-sigs/cluster-api/blob/master/tools/repair/util/repair.go

Since this has become topical, I am including a more detailed description (fixed link)* of our intentions so that the community can provide feedback while we update the Node fence design proposal.

[*] In our context, this topic is also linked with concerns about ensuring the design is feasible for bare metal. This is mostly orthogonal to the concept of and need for fencing as well as the recovery logic that we are looking to create.

Coordinating on the existence of the pod object is what I expected.

+1 leave the a/d controller dumb -- and react to pod state.

tl;dr - I would advocate for solving this with an integrated fencing approach rather than heuristics

Seems like the right approach. Improve the logic for determine the state of node and update pod state accordingly, and a/d (and other controllers) can react accordingly.

The question is, how soon will this bug get fixed, if one PR after another are being held/rejected so often? This bug is quite critical, so IMHO it deserves to be fixed ASAP, even if the solution is not perfect.

I'd also like to suggest, that probably most network/distributed/replicated volume drivers should be able to independently determine if a volume is safe to detach from the given node, because they "know" which clients (nodes) are still connected to the storage server/cluster. So if a node is down (or just disconnected), some volume drivers could provide that information to k8s. Please correct me if I'm wrong.

@adampl @yastij @nikopen as an interim workaround, is it possible to write an external controller that watches for this node shutdown taint and force deletes the pods on that node? It's becoming clear that reacting specifically to this node shutdown taint in the attach/detach controller is racey and not going to help statefulset pods, so is not the solution we want to continue pursuing.

@msau42 and if it is external resource, you mean like pod running in kubernetes which will watch node taints and react according to that? How we can instruct everybody to install it? Or new controller inside kubernetes?

Yes, it can be run as a pod in kubernetes. The distro/provider would need to provide instructions on how to run it.

It would be a similar issue if we alpha gated a workaround in-tree. You would need to instruct users to enable the feature gate.

@msau42 @liggitt @saad-ali - Do we want to fix this for 1.12 ? If yes we can file an exception for this and start considering a taint based eviction design.

Thanks for all the discussion. The fix I proposed in this comment https://github.com/kubernetes/kubernetes/issues/65392#issuecomment-417509512 is NOT a force detach actually.
When node is tainted with "shut down", we are sure that the mounts are gone and containers are no longer running. It is safe for node controller to mark container status and remove VolumeInUse information from node status (note VolumeInUse field is used to indicate which volumes are mounting/mounted so that it avoid the race condition between master and kubelet for detaching while it is still mounted).
Now the attach_detach controller does not need any change at all. When the pod is evicted by node controller, since the container status is no long running, attach_detach controller will first check whether it is safe to detach and then try to detach it because the node status shows that volume is no longer mounted.
So I think this fix is safe.

The proposal @msau42 had can also help fix the problem. But it might cause a race condition during force delete. If the shut down node happens to restart at this moment. Detach might happen while volume is mounting. Please let me know if you have any questions about the fix or any concerns. Thanks!

When node is tainted with "shut down", we are sure that the mounts are gone and containers are no longer running. It is safe for node controller to mark container status and remove VolumeInUse information from node status

This may be problematic for stateful sets. Lets say a pod backed by stateful sets is running on node X. The node is shutdown and is tainted. The pod's containers are marked as "NotRunning" and volume gets detached from node X. Now when node X, does come back - the pod backed by stateful set will no longer be able to run on it. afaict - there will be no new events related to pod and hence attach/detach controller will not reattach the volume back.

@jingxu97 @dchen1107 and I brainstormed designs for a long term solution. To fully fix this issue and to be able to address possible race conditions of the node coming back up at any time, we'll need to carefully think through how to handle synchronization between node controller, attach/detach controller, kubelet, and possibly other higher level controllers like StatefulSet to avoid detaching volumes while they are actually still mounted. I don't think there will be time to address this in 1.12 even with an exception.

As a short term workaround, if you really need this behavior, we are suggesting the solution of:
1) Making the force detach timeout in A/D controller configurable. Somebody needs to create a PR for this for code freeze.
2) Implementing an external controller that force deletes Pods on a node that is shutdown.

Just beware that both of these workarounds can increase the risk of detaching volumes when they are still mounted.

Just to be clear "making force detach configurable" will do nothing to change logic of volume detachment. It will only cause detach faster when pod is already deleted.

@msau42 now when nodes are not deleted anymore from cluster(cloudprovider change in behaviour in most of clouds), configuring that detach timeout does not matter. Pods cannot move anymore. The behaviour is this https://github.com/kubernetes/kubernetes/pull/67977#issuecomment-417233035

If I remember correctly the old behaviour before cloudprovider change were that after force detach timeout it detached the drive and pod was able to move.

So without that external controller we might see some bug tickets about this

The previous behavior is that when a Node object gets deleted, the garbage collector will force delete pods on that node. Now that the Node object is not deleted anymore, there is nothing to force delete pods. (The node controller will evict pods after 5 minutes of kubelet unresponsiveness, but it is not a force delete). So this external controller workaround is to mimic that similar behavior as if the Node object is deleted, however, with the increased risk that the Node may actually come back.

I take back my recommendation to make force detach configurable and shorten it. Because it's going to impact our detach when still mounted safety mechanism even in normal pod termination scenarios.

So with just the controller workaround, the volume will get force detached after 6 minutes of the pod being force deleted.

So it seems that:

Shutdown taint + InstanceShutdownByProviderID gives the ability to differentiate between Shutdown and Unresponsive kubelets by properly identifying shutdowns, verifying a 'safe to delete/detach' status.

However, 'ISBPID' is currently implemented only for AWS and Openstack, with the rest likely not making it for 1.12 (?) .
And the above cannot cover bare-metal at all.

As per its functionality itself, it can be as simple and gracious as initiating a cordon/drain, in conjuncture with the cloudprovider, in its finalized form. I think it's best for the shutdown taint to remain as simple as possible in itself, exposing a solid status and letting other actors act on it.

@msau42 I'm thinking of the following behaviour for the 'temporary' external controller:

Node shutdown case (when taint is implemented and working):

  1. force delete all pods after a short timeout (e.g. 5 sec)

Unresponsive kubelet case:

  1. Catch event of unresponsive kubelet
  2. Confirm it stays unresponsive several times over X minutes (e.g. poll status every 20 sec for 4 mins)
  3. If it's still down after max timeout, force delete all pods associated with it.

The above can be relatively safe. If there's something more that can make it safer it would be good to add it. Can start working on it soonish.

A custom extra failsafe could be, a sidecar pod for pods with volumes or/and StatefulSets that checks whether main pod is active / reads-writes from volume, in intervals. If not, it can talk to the controller and verify that it's safe to delete & detach. Any suggestions welcome

For 1.13 and beyond, it would be great to form a team to deliver a solid solution over volume detaching on dropped nodes for the longer term, by handling synchronization/utilizing fencing/other.

@nikopen

The above can be relatively safe

Sure, if the node has lost power but there are also scenarios (which look identical from within the system) where the pod could be alive and writing to the attached volume. Eg.

  • Network outage or misconfiguration, unless kubelet is using the same path as the volume AND writes require end-to-end confirmation
  • Wedged kubelet

A custom extra failsafe could be, a sidecar pod for pods with volumes or/and StatefulSets that checks whether main pod is active / reads-writes from volume, in intervals. If not, it can talk to the controller and verify that it's safe to delete & detach. Any suggestions welcome

If the controller is local, then you'll be able to able to reach it during a network outage but presumably it won't have any additional knowledge to verify that it's safe to delete & detach.

If the controller isn't local, then it will be able to do the verification, but you wont be able to reach it from inside the node while the network is wonky.

Implementing an external controller that force deletes Pods on a node that is shutdown.

@msau42 @jingxu97 IIUC, shutdown here means the node is tainted ? What about the volumes in bare-metal nodes which are not managed by cloud providers ?

@NickrenREN - This has to be done by some external controller.

So we need a detailed and complete proposal for this.
Any design or proposal link for this ? @yastij

@beekhof if 'wedged' kubelet means unresponsive kubelet, I already gave an answer to that in https://github.com/kubernetes/kubernetes/issues/65392#issuecomment-417880760 - if kubelet stays unresponsive then pods should migrate elsewhere anyway.

re: network outage, generally, if it's happening then there's bigger problems to attend to.
But if split-brain is the problem case then it should be solved.

Ideally, the storage driver itself is able to provide the 'safe to detach' information.
A list of cloudproviders and storage drivers that are able to do this would be quite helpful.

In setups where that's not possible yet, then a local sidecar can handle it as an additional functionality on top of the main external controller: if the local/sidecar is not able to speak to the controller + the kubelet is dead + the problem pod is inactive, then force delete / pod self-destruct. The verification parts of 'inactivity' checking might include some requirements / interfaces to talk to, i.e. at least alpine for pods with volumes. This solution can be greatly improved.

When most cloud providers and storage drivers are able to give the 'safe to detach' status, safe detaching logic should be easier to be integrated into k8s/CP code itself.

@NickrenREN @msau42 As I see there is much confusion of the shutdown taint, and as per https://github.com/kubernetes/kubernetes/issues/58635#issuecomment-397771924
InstanceShutdownByProviderID is not yet implemented in most cloudproviders, will the taint really make it to 1.12?

An external controller can work for unavailable kubelets regardless of the shutdown taint status (which would make things easier but still needs time and work to be properly implemented IIUC).

@nikopen https://github.com/kubernetes/kubernetes/pull/67977#issuecomment-416836342 it do exist nowadays in quite many.

azure https://github.com/kubernetes/kubernetes/pull/68033

gce is missing, but other big players do have that

if kubelet stays unresponsive then pods should migrate elsewhere anyway.

@nikopen but how would you arrange that or confirm it took place without being able to talk to the kubelet?

In most cases k8s can spin up replacement Pods regardless, but StatefulSets and Pods with persistent volumes must wait until either kubelet comes back or something somehow confirms that it is safe. Otherwise there is a very real risk of corruption.

@beekhof If we know that a node is shut down (tainted), then it is always safe to detach, though the kubelet is not responding. The unresponsive kubelet (while the node is not tainted as shut down) is a different case that needs a more sophisticated solution. IMHO these two cases should be fixed separately, so the first is not blocked by the second.

Unresponsive kubelet is already handled by the node controller, which evicts pods (not force delete) after 5 minutes.

I don't think node shutdown can be 100% guaranteed to be safe to detach a volume, which is why I think force deleting a pod after a short timeout is unsafe. Node shutdown is only a temporary state. What if the node is just rebooting and comes back up quickly?

@msau42 @beekhof @verult

This issue needs to be of priority/critical-urgent to remain in the 1.12 milestone. Can you please verify and triage?

cc @childsb @saad-ali

Unresponsive kubelet is already handled by the node controller, which evicts pods (not force delete) after 5 minutes.
@msau42 if the kubelet is not responsive, how can pods be evicted from the nodes to not continue and write to the storage and corrupt it if used by another pod on another node?

I don't think node shutdown can be 100% guaranteed to be safe to detach a volume, which is why I think force deleting a pod after a short timeout is unsafe. Node shutdown is only a temporary state. What if the node is just rebooting and comes back up quickly?
a node shutdown shouldn't release the PV/PVC. a node known to have been shutdown should allow to force delete the pod, releasing the storage?

@msau42 differentiate between shutdown and reboot on the cloudprovider level, i.e. go the cloudprovider way.

Give the shutdown taint only when a VM has been issued a clear 'shutdown' signal
or
Poll the VM's status for 1m. It shouldn't be in 'Shutdown state' for more than that when rebooting.

A controller observing the shutdown taint then issues a node drain, triggering the attachdetach controller as well. This could be on the node controller.

The shutdown taint feature by definition polls external logic, e.g. cloud provider, to work. It only works wherever 'InstanceShutdownByProviderID' is implemented - atm the shutdown taint works only on AWS and OpenStack.

It might be quite hard to conceive a safe catch-all way for both bare metal and any cloudproviders alike.

@guineveresaenger this is not planned to be resolved in 1.12

@nikopen what I mean is that you will always have a window where the node can come back up from a shutdown state. Not just a reboot, but the user can also power it back on. So whatever timeout value we choose won't be race free.

To clarify, here is a sequence that demonstrates the race with force deleting the pod:

  1. VM gets shutdown
  2. Shutdown taint gets added to node.
  3. Controller sees the shutdown taint on the node.
  4. VM comes back online and starts running pods again
  5. Controller force deletes pods on nodes
  6. AD controller detaches volumes while container still has them mounted.

@msau42 would you mind removing it from the milestone? I can do it but don't really want to go over anyone's head. :)

I think, it should be captured in release notes of some sort. I am not sure what is the best way of doing this, but since before this release the default behaviour on Openstack, vSphere and AWS was to detach volumes from shutdown nodes after this release that no longer will be the case. Detach in GCE-PD never worked but was masked by GCE's deployment strategy. Those who use kops to deploy on AWS may also not see this bug, because kops deploys everything in ASG and a shutdown node is automatically terminated and replaced.

@guineveresaenger i don't have milestone permissions

let's see if I do... ;)
/milestone clear

@guineveresaenger: You must be a member of the kubernetes/kubernetes-milestone-maintainers github team to set the milestone.

In response to this:

let's see if I do... ;)
/milestone clear

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

nope, I don't.

cc @saad-ali, @childsb, @tpepper - could you please remove from milestone?

@msau42 Please correct me if I'm mistaken, but the node controller evicting Pods in response to an unresponsive kubelet doesn't actually result in the Pods terminating and/or their Volumes becoming safe to mount elsewhere until after the kubelet (or its replacement) comes back. Right? So I would disagree that that use case is handled.

@adampl No objection to treating both separately, although it sounds like the result will still be racey. So maybe its an interim approach only.

@guineveresaenger, @gnufied Agree that this shouldn't hold up a release. IMHO this is an effort that will take serious time and effort to resolve completely.

To clarify, here is a sequence that demonstrates the race with force deleting the pod:
VM gets shutdown
Shutdown taint gets added to node.
Controller sees the shutdown taint on the node.
VM comes back online and starts running pods again
Controller force deletes pods on nodes
AD controller detaches volumes while container still has them mounted.

@msau42 to avoid the race, controller should not remove the shutdown taint until after force delete. node should not start pods while it has shutdown taint?

@beekhof correct, the pod eviction is not a force delete of a pod, so A/D controller will not try to detach because the Pod object won't be updated. I think in either case, unresponsive kubelet or node shutdown, you need to force delete pods. The difference is that in an unresponsive kubelet case, we really can't tell if it's safe to detach, and in a node shutdown, you can have a better chance of guessing right.

@itamarh this requires change to kubelet to look for the shutdown taint. And it also assumes that the controller adding/removing the taint is the same one that is force deleting pods. There may also be timing issues between Pod and Node API object syncing that need to be investigated.

If this issue is to be removed from 1.12, then please at least un-hold #67977 and add there the logic to prevent the race condition when the pods are restarted after shutdown.

BTW, in case of RWO volumes of types (drivers) that don't support RWX mode, the double mount would always be rejected, so the race would be harmless, right? I guess the problem lies only in types that do support RWX, as they don't reject unintentional double mounts of RWO volumes (unless they are somehow aware of the desired mode too, which I doubt).

/milestone clear

@adampl the AccessModes(RWO and RWX) will actually not affect the mount operations. It is designed for PVC/PV matching originally, then it is used for volume attaching(not for volumeMounts).
We need to clean up the AccessModes stuff too, and sent an issue out before ,https://github.com/kubernetes/kubernetes/issues/67936, will start working soon

@msau42 @jingxu97 - One of the solutions would be to differentiate between a shutdown and a reboot. IIRC cloud providers differentiate reboots from shutdown and trigger an eviction logic only if it is a real shutdown. This could solve the issue of the node coming back, Then we could mark pod as stale using PodConditions and mount volumes only for non-stale pods.

Indeed, as I mentioned above.

i.e.: Clear pods before actually shutting down, in a gracious way, ensuring pods have been deleted.

@nikopen For planned shutdown the graceful way is kubectl drain.

The problem is with unresponsive node or kubelet. Even draining can fail in the middle of the process if the node/kubelet unexpectedly crashes or loses connection, so the cluster must be able to isolate the failed node (AKA fencing). IMHO it should be like:

  1. cordon the node
  2. initiate shutdown (if it's running)
  3. check shutdown
  4. force delete all pods

But then there must be a way to do this on bare-metal nodes too.
So we need an API for external services managing the power status of bare-metal nodes.

Agree with @adampl

Fence mechanism here is needed. Talked with @jingxu97 @rootfs and @msau42 offline, we need to put it outside kubernetes since it is hard to completely fix the race conditions across kubernetes components and the potential change may break kubernetes a lot.

To clarify, we care not only node Shutdown , but also unknown condition since it may cause split-brain and double writing. And we care not only cloud provider node instances, but also bare metal nodes.

The implementation can start out simple and we here have some database customers who need this too.
We will discuss and prototype this, and they can provide large database instances environment to verify the implementation. Will send out the proposal when finish

/cc @jingxu97 @msau42 @orainxiong

/cc @ddysher

@NickrenREN I think fencing can be done inside of kubernetes and previously posted a link to an informal design

One of the items on my TODO list is to incorporate the updated design into the existing proposal but I would welcome anyone interested in collaborating to get in touch or suggest changes before then.

/cc @jingxu97 @msau42 @orainxiong @adampl @ddysher

Hi @jingxu97 , I understood from @saad-ali that you going to handle this PR for 1.13 release.
I would like also to raise your attention for complementary issues -> https://github.com/kubernetes/kubernetes/issues/68086 and https://github.com/kubernetes/kubernetes/issues/69168 which in same way related to the subject.

Hey @jingxu97 @saad-ali @msau42 ,

as I'm keeping track of issues for the release team of 1.13,

is there an estimate whether this and #68086 , #69168 make it into v1.13?

hi, @nikopen do we have any decision on the solution and release plan of this?
thanks.

I believe @jingxu97 has a proposal she put together for addressing this. It requires changes on the node side, so ideally we need sig-node to pick it up.

/milestone v1.14

@jingxu97 - can we sync around this proposal, any KEP out there yet ?

@jingxu97, some other guys and i are talking about the Node Fencing implementation offline, will send the proposal out when ready

+1 on fencing. If it's split into two categories of physical & external we can have nodes submit their STONITH details on admission, otherwise, the cloud provider (or a STONITH provider?) can provide resolution to requests for fencing.

Hi @jingxu97 @saad-ali , code freeze for 1.14 is coming in 10 days, if this won't be ready in the next 2-3 weeks feel free to move it to the v1.15 milestone.

@nikopen - the enhancement kubernetes/enhancements#719 was moved from 1.14, this issue should be also moved.

ok!

/milestone v1.15

Hi @yastij, how this issue going to be solved in 1.15? by having new fencing mechanism or by just manually tag the taint for eviction?

Thanks

@shay-berman - we will rely on observing the taint and adding some changes to the kubelet to make it safe to detach volumes from nodes.

Thanks for the update.
So after the fix - does the user will have to do something manually in order to recover the stateful PODs that stuck on the crashed worker node?
or it will be automatic decision by k8s?

in cloudprovider mode, there will be no need for user to operate manually. For on-premise setups users will need to either:

  • provide an external component that does this logic.
  • operate manually the tainting/un-taint operation

Hi! We are starting the code freeze for 1.15 tomorrow EOD. Just checking in to see if this issue still planned for the 1.15 cycle?

@timmycarr - this is not landing in 1.15

/milestone clear

to be revived when new proposal is out

This issue has been punted multiple times. In reality, it hurts many users. Pods being stuck in unavailable state for 6+ mins is not a sustainable solution for keeping the production workloads highly available. I understand that the final solution most likely won't be simple, and would require more design discussions. But do you think a smaller fix by making ReconcilerMaxWaitForUnmountDuration configurable is doable till the final resolution is in place? It would help users a lot.

cc @shay-berman @liggitt

@alena1108 thanks for sharing your concern. Just want to clarify your scenario, is your pods belong to statefulset or deployment or something else? Is your pods being deleted after stuck in unavailable a few minutes? Thanks!

@jingxu97 I've seen it happening for deployment aws (ebs volume) and vsphere setups for our users. When the node is shutdown and become unavailable, the volumes linger on the host for 6 min, and the new pod can't be started on the healthy node till the old volume is detached.

The stateful can be affected by this behavior as well. We have documented the details here

@alena1108 can you describe the scenarios where node is shutdown without doing a cordon and drain first?

@msau42 when node gets shutdown/goes down on the provider side by various reasons - a very basic use case. When admin had no intent on removing it from k8s and going through the proper cordon/drain process.

Can you be more specific about what kinds of "various reasons" there are? It's not obvious to me and I want to understand this basic use case.

@msau42 I can only speak for my own experience here, but I've seen behavior like this when the underlying hardware for a VM running a node fails. The cluster admin never has the opportunity to cordon/drain the node first in that scenario.

This prevents the volume from being reattached to another node even if the cloud provider knows that the instance is powered down (and therefore that the volume definitely isn't in active use on that node) as to my knowledge there isn't a node taint that lets us signal this status to k8s.

It's less of an issue if you have node autoscaling available (as the volume can be attached to another node as soon as the failed node is deleted), but when that's not an option, as far as I know any pods with attached PVs can't be rescheduled until the node comes back online or is removed from the cluster entirely. (It's been a while since I've run into this myself and looked into it though, if that's no longer the case I'd be glad to be wrong about this!)

There's an ongoing work on a new solution to this problem: https://github.com/kubernetes/enhancements/pull/1116

However, I'm still unsure how it will work with non-cloud clusters.

@msau42 Machines sometimes fail...

@adampl thanks for the link. Resolving the issue one way or another, even if limited to cloud provider nodes only, would be great. Do you know what k8s release it's targeted for? cc @smarterclayton

@alena1108 - hopefully 1.16

@NickrenREN Hi, any info about the proposal for node fencing on cloud and BM?

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale
/lifecycle frozen

Was this page helpful?
0 / 5 - 0 ratings