Origin: Ceph : Unable to mount volumes for pod ... : rbd: image ext4-0002 is locked by other nodes

Created on 14 Mar 2016  路  21Comments  路  Source: openshift/origin

Version

oc v1.1.3
kubernetes v1.2.0-origin

Description

All pods using a ceph persistent storage are in state :
Unable to mount volumes for pod ... : rbd: image ext4-0002 is locked by other nodes

Steps To Reproduce

  1. All pods were running correctly on all nodes
  2. A node crash during the week-end (don't know why wet)

    Current Result

All pods using ceph persistent volumes are in this state.
On the ceph cluster, each rbd has one exclusive lock

Expected Result

Unlock the rbd before mounting it on another node

componenstorage lifecyclrotten prioritP2

Most helpful comment

yes, this is being worked on and hopefully get into kubernetes 1.5. There are certain interface issues we have to resolve first before getting attach/detach available for rbd.

The attach/detach interface is a breakup of attaching/detaching of volume from node. The master is responsible for device attach and detach while node handles device mount and unmount. In rbd case, the lock and unlock will move to controller. This should solve the locking issue when node is unavailable.

@kubernetes/sig-storage

All 21 comments

The crashed node probably has a lock on the rbd image. At this time, you have to unlock the image manually if you think no other pod is using it. We'll have a better way to resolve it when we have attach/detach controllers from @swagiaal

@rootfs Does https://github.com/kubernetes/kubernetes/pull/26351 solve this issue ?

@spinolacastro not completely. There are some followup works around the new attach/detach controller refactor then we'll make rbd to support these controllers.

I am running kubernetes 1.3.3 and still see this issue. If compute node goes down, containers get stuck in CreatingContainer state if a container uses ceph volume, all other container properly migrate to the second compute node without any issue. Manually clearing lock on ceph cluster resolves this issue.

@sbezverk Same thing here on 1.3.3.

For other people encountering this problem, workaround is to issue this to get your locks:

rbd lock list image_name
There is 1 exclusive lock on this image.
Locker      ID                                                         Address                  
client.1234 kubelet_lock_magic_node-nodename 10.1.2.3:0/1164393387 

And then:

rbd lock remove image_name "kubelet_lock_magic_node-nodename" client.1234

A more dangerous route would be to disable locking via client_rbd_default_features in ceph.conf, or doing it on a per-image basis:

rbd feature disable image_name exclusive-lock

@sbezverk @elsonrodriguez i've made a script to remove the locks from a given node

Take a look:
https://gist.github.com/spinolacastro/03a31455a90665b85ce6175cba6862d6

It must be run on your ceph admin node:

Given the lock:

[azureuser@jump cephcluster]$ rbd lock list 19943
There is 1 exclusive lock on this image.
Locker                ID                                               Address
client.2583741 kubelet_lock_magic_nodebr6 10.0.2.10:0/1041159

To release the lock from all images on nodebr5

./client.py ceph.conf ceph.client.admin.keyring nodebr5
Image 19943 locked by client.2583741 kubelet_lock_magic_nodebr6 10.0.2.10:0/1041159, releasing...
Image 20618 locked by client.2592895 kubelet_lock_magic_nodebr6 10.0.2.10:0/1118475, releasing...
Image 22406 locked by client.2600403 kubelet_lock_magic_nodebr6 10.0.2.10:0/1069391, releasing...

Let me know if worked.

@rootfs Should this issue be opened in the kubernetes repo as well?

Hi all,

i have read through the code of the rbd volume plugin. It executes a rbd lock to get a lock on the image and releases it on on umounting the image. The problem starts if it fails to do so on shutdown / reboot.

I solved the issue for me by doing two thins:

  • first i made sure that every service that is required to access the ceph cluster is started before the kubelet, and stopped only after kubelet has finished.
  • second I drain the node on shutdown, and then give the kubelet the time to actually unmount every thing. On startup of the node I do the reverse thing by uncordoning it.

I don't think this is actually a implantation issue of the rbd volume plugin, because I think it is the right thing to acquire the lock and to release it. If you don't do this you risk having the rbd mounted twice and end up with currupted data if two processes are writing to it.

Hi

There is solution discussed on ceph mailing list. When you start the pod and the rbd volume is locked there is two possible reasons. One is that your try to mount the volume twice, which can lead to disaster (multiple pods writing to one block level device is really a bad idea), the second one is that the locker crashed. If it is a bare metal deployment than this can be common - hardware failure will lead to crash and rbd volume left locked. If there is no mechanism to deal with this situation, you loose HA which is also bad (now single HW failure can lead to service outage no matter that you have a lot of completely working machines which can take the load).

The proposed solution on ceph mailing list was ( http://www.spinics.net/lists/ceph-devel/msg08894.html ) to break the lock and blacklist client who did the lock before. This way it is possible to have consistent working state all the time.

There is still one problematic situation though - rbd is implemented as mounted on the host than mapped into container. We still can have multiple containers on single host accessing the single volume but that can't be solved by rbd locking.

In case of the Kubernetes rbd volume plugin the locking is done explicitly by the plugin and is not something ceph does implicitly . It is done in rbdLock function here:
https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/rbd/rbd_util.go#L98

This function is also responsible for removing the lock. If this is not done properly on kubelet shutdown / pod failure then we see the Error message this thread is about genrated here, when the next node calls the lock function:
https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/rbd/rbd_util.go#L269

What I did to solve this was as I said draining the node on shutdown / reboot, and give the kubelet some time to onlock the voume + make sure that it can still reach the ceph cluster at this point of time.

Right now I am running a test script that every hour finds the node on wich a pod with an rbd attached is running and reboots it. Right now it is in the eleventh iteration and the pod moved 11 times to another node without problems.
I intend to keep it running for a few days to gain confidence in it, but for now it seems to be stable.

@cornelius-keller that assumes the situation when the node reboots correctly - what happens when the node fails? Or starts to misbehave in such way, that there will be no clean drain of the node possible?

@pavels you are right the current code of the rbd plugin assumes that unmount / unlocking is happening and has no means to recover from the state when it doesn't. And it would be great to have this recovery code somehow in the plugin. For me it was important to understand that this is not a ceph / rbd related issue. It is the plugin creating the lock when mounting rw and not recovering from the state when it was not removed correctly at shudown.

For me it was important to understand the reason for the problem and also to find a solution to the most frequent cases. In my case I had the Problem also during regular reboots before draining the node etc.
Having understood this regular reboot / maintenance scenario solves already a big part of the Problem.

The disaster / node failure Problem is not solved yet by my approach.
I intend to protect myself against this by having for example a monitoring that alerts me if a pod stays for a certain amount of time in the "ContainerCreating" state. Also I was thinking about having a script that removes all locks still held by this host before it starts the kubelet. Yet this will still not cover the case were the node does not come up again at all ...

@cornelius-keller i actually implemented patch for docker rbd plugin to do lock takeover and tested that, it worked good ( https://github.com/porcupie/rbd-docker-plugin/pull/9 )

https://github.com/kubernetes/kubernetes would be probably better place to go forward? As this is not openshift specific either.

@pavels Great stuff, looking forward to see this solved inside the module. Still I think that caring for a proper shutdown is not a bad idea, because it will also take some time for kubernetes to determine that a node is dead and should be scheduled elsewhere. With draining this happens much faster.

@rootfs you mentioned the "attach/detach controller refactor" back in June. Are there any updates/issues/pull requests related to this?

I feel like the priority level of this issue should be escalated. Ceph essentially becomes broken in production for containers if you assume hardware failures.

yes, this is being worked on and hopefully get into kubernetes 1.5. There are certain interface issues we have to resolve first before getting attach/detach available for rbd.

The attach/detach interface is a breakup of attaching/detaching of volume from node. The master is responsible for device attach and detach while node handles device mount and unmount. In rbd case, the lock and unlock will move to controller. This should solve the locking issue when node is unavailable.

@kubernetes/sig-storage

rbd lock list tenx-pool/qinzhao.CID-ca4135da3326.srv-gitlab-config
2017-06-06 20:10:52.594971 7fb54be29700 0 -- :/9849940 >> 10.39.0.115:6789/0 pipe(0x2fb5550 sd=6 :0 s=1 pgs=0 cs=0 l=1 c=0x2fae8e0).fault

no locker
ceph -v
ceph version 0.94.10 (b1e0532418e4631af01acbc0cedd426f1905f4af)

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Was this page helpful?
0 / 5 - 0 ratings