Origin: openshift node becomes dysfunctional because of devicemapper issues with docker

Created on 12 Jun 2017 · 19Comments · Source: openshift/origin

openshift node becomes dysfunctional because of devicemapper issues. Happens on a running cluster, where the node was previously functioning well.

See issue: https://github.com/moby/moby/issues/23089

Version

openshift v3.6.94
kubernetes v1.6.1+5115d708d7
etcd 3.1.0

Steps To Reproduce

On a large cluster (100+ nodes), create several thousand pods at once (4000)
See that some nodes have trouble creating pods

Current Result

Node not ready

Origin node gives this error in the log:

 19246 kuberuntime_gc.go:123] Failed to remove container "f2d792518c47a670a633696c69b8360340d2cab70e8cc7087c510844bb5cad32": rpc error: code = 2 desc = failed to remove container "f2d792518c47a670a633696c69b8360340d2cab70e8cc7087c510844bb5cad32": Error response from daemon: {"message":"Driver devicemapper failed to remove root filesystem f2d792518c47a670a633696c69b8360340d2cab70e8cc7087c510844bb5cad32: failed to remove device d6a003f88219386bd95462cfc7e78aecde7cd605fb358ff17fd2e1fa2731fa90:devicemapper: Can not set cookie: dm_task_set_cookie failed"}

Expected Result

Node should be ready and pods should run

Additional Information

Docker daemon restart gives this error:

[openshift@b10-h06-r620 ~]$ sudo systemctl restart docker
Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.
[openshift@b10-h06-r620 ~]$ journalctl -flu docker
-- Logs begin at Sun 2017-06-11 22:41:55 UTC. --
Jun 12 17:59:59 b10-h06-r620.rdu.openstack.engineering.redhat.com_4.example.com systemd[1]: docker.service failed.
Jun 12 18:31:14 b10-h06-r620.rdu.openstack.engineering.redhat.com_4.example.com systemd[1]: Starting Docker Application Container Engine...
Jun 12 18:31:14 b10-h06-r620.rdu.openstack.engineering.redhat.com_4.example.com dockerd-current[14823]: time="2017-06-12T18:31:14.688724739Z" level=info msg="libcontainerd: new containerd process, pid: 14832"
Jun 12 18:31:15 b10-h06-r620.rdu.openstack.engineering.redhat.com_4.example.com dockerd-current[14823]: time="2017-06-12T18:31:15.708582348Z" level=warning msg="devmapper: Usage of loopback devices is strongly discouraged for production use. Please use `--storage-opt dm.thinpooldev` or use `man docker` to refer to dm.thinpooldev section."
Jun 12 18:31:15 b10-h06-r620.rdu.openstack.engineering.redhat.com_4.example.com dockerd-current[14823]: time="2017-06-12T18:31:15.715165962Z" level=error msg="[graphdriver] prior storage driver \"devicemapper\" failed: devmapper: Base Device UUID and Filesystem verification failed: devicemapper: Can't set cookie dm_task_set_cookie failed"
Jun 12 18:31:15 b10-h06-r620.rdu.openstack.engineering.redhat.com_4.example.com dockerd-current[14823]: time="2017-06-12T18:31:15.715998695Z" level=fatal msg="Error starting daemon: error initializing graphdriver: devmapper: Base Device UUID and Filesystem verification failed: devicemapper: Can't set cookie dm_task_set_cookie failed"
Jun 12 18:31:15 b10-h06-r620.rdu.openstack.engineering.redhat.com_4.example.com systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
Jun 12 18:31:15 b10-h06-r620.rdu.openstack.engineering.redhat.com_4.example.com systemd[1]: Failed to start Docker Application Container Engine.
Jun 12 18:31:15 b10-h06-r620.rdu.openstack.engineering.redhat.com_4.example.com systemd[1]: Unit docker.service entered failed state.
Jun 12 18:31:15 b10-h06-r620.rdu.openstack.engineering.redhat.com_4.example.com systemd[1]: docker.service failed.

componencontainers kinbug lifecyclrotten prioritP2

Source

rajatchopra

Most helpful comment

Two workarounds so far (both maybe needed):

The semaphore leak reaches the 128 array limit of the system soon. Change the default to 8192 maybe.

sudo printf 'kernel.sem = 250\t32000\t32\t8192\n'  > /etc/sysctl.d/99-kernelsem.conf
sudo sysctl --system

And I get to push the issue away in future.

And if you do hit the issue anyway, clean up the system:

sudo systemctl stop origin-node
sudo systemctl stop docker
sudo dmsetup remove_all
sudo dmsetup -y udevcomplete_all
sudo systemctl start docker
sudo systemctl start origin-node

rajatchopra on 16 Jun 2017

👍4

All 19 comments

Likely core issue: https://github.com/moby/moby/issues/33603

rajatchopra on 12 Jun 2017

👍1

@rhvgoyal FYI -- I know you are active on the upstream issue

jwhonce on 13 Jun 2017

Related devicemapper issue: https://bugzilla.redhat.com/show_bug.cgi?id=1461370#c6

Pod creation failed with error:
Error response from daemon: {\"message\":\"devmapper: Error activating devmapper device for '1b87bd5a693a2b9fef9008e122b07237d63ed69d913111063c85adf99c5c912e-init': devicemapper: Can't set cookie dm_task_set_cookie failed }

pravisankar on 16 Jun 2017

Two workarounds so far (both maybe needed):

The semaphore leak reaches the 128 array limit of the system soon. Change the default to 8192 maybe.

sudo printf 'kernel.sem = 250\t32000\t32\t8192\n'  > /etc/sysctl.d/99-kernelsem.conf
sudo sysctl --system

And I get to push the issue away in future.

And if you do hit the issue anyway, clean up the system:

sudo systemctl stop origin-node
sudo systemctl stop docker
sudo dmsetup remove_all
sudo dmsetup -y udevcomplete_all
sudo systemctl start docker
sudo systemctl start origin-node

rajatchopra on 16 Jun 2017

👍4

I updated original issue. We suspect that a recent commit in docker caused this cookie/semaphore leak issue.

@nhorman is looking into writing a patch for it now.

rhvgoyal on 19 Jun 2017

Check if following PR fixes the issue.

https://github.com/moby/moby/pull/33732

rhvgoyal on 19 Jun 2017

Hi, I'm trying to help someone who is having similar issue with Docker 1.12.6; those references to Docker issue and PR all seem to reference recent changes, but you wouldn't be using such a recent Docker with Kubernetes, would you?

bboreham on 28 Jun 2017

sudo printf 'kernel.sem = 250\t32000\t32\t8192\n' > /etc/sysctl.d/99-kernelsem.conf
sudo sysctl --system

this works for me , since it will keep permantently the semaphore limit as 8192 . one can issue the command "ipcs -su" to know how manay semaphores are in use , but how can I know what process is using these semaphore , can anone help me , thank you .

NickSu86 on 29 Jun 2017

We have backported fix for this in projectatomic/docker as well. So please take latest docker build from your source and it might have the fix.

https://github.com/projectatomic/docker/pull/256
https://github.com/projectatomic/docker/pull/255
https://github.com/projectatomic/docker/pull/254

rhvgoyal on 29 Jun 2017

We are using docker-1.12.6-28.git1398f24.el7.centos.x86_64.rpm and we are seeing the following issues on an upgrade of docker version.

Jun 29 18:49:13 openshift-master-01 systemd[1]: Starting Docker Application Container Engine...
Jun 29 18:49:13 openshift-master-01 dockerd-current[124060]: time="2017-06-29T18:49:13.924855449Z" level=info msg="libcontainerd: new containerd process, pid: 124068"
Jun 29 18:49:14 openshift-master-01 dockerd-current[124060]: time="2017-06-29T18:49:14.946117449Z" level=warning msg="devmapper: Usage of loopback devices is strongly discouraged for production use. Please use --storage-opt dm.thinpooldev or use man docker to re...hinpooldev section."
Jun 29 18:49:14 openshift-master-01 dockerd-current[124060]: time="2017-06-29T18:49:14.946590822Z" level=error msg="[graphdriver] prior storage driver \"devicemapper\" failed: devmapper: Base Device UUID and Filesystem verification failed: devicemapper: Can't set co...k_set_cookie failed"
Jun 29 18:49:14 openshift-master-01 dockerd-current[124060]: time="2017-06-29T18:49:14.946758304Z" level=fatal msg="Error starting daemon: error initializing graphdriver: devmapper: Base Device UUID and Filesystem verification failed: devicemapper: Can't set cookie ...k_set_cookie failed"
Jun 29 18:49:14 openshift-master-01 systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
Jun 29 18:49:14 openshift-master-01 systemd[1]: Failed to start Docker Application Container Engine.
Jun 29 18:49:14 openshift-master-01 systemd[1]: Unit docker.service entered failed state.
Jun 29 18:49:14 openshift-master-01 systemd[1]: docker.service failed.

It looks like this is fixed in the upcoming docker-1.12.6-32.git88a4867.el7 version

An alternative appears to be downgrade/remain at docker-1.12.6-16.el7 according to the Bugzilla report.

Edit:
Now available in Centos mirror:
http://mirror.centos.org/centos/7/extras/x86_64/Packages/docker-1.12.6-32.git88a4867.el7.centos.x86_64.rpm

xThomo on 3 Jul 2017

i'm afraid the issue is not fixed in the docker version mentioned above

rpm -qa|grep docker
docker-client-1.12.6-32.git88a4867.el7.centos.x86_64
docker-common-1.12.6-32.git88a4867.el7.centos.x86_64
docker-1.12.6-32.git88a4867.el7.centos.x86_64
origin-docker-excluder-1.4.1-1.el7.noarch

and

openshift version
openshift v1.4.1
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0

as i still got hit by it

DanyC97 on 22 Oct 2017

It is actually likely PR #33376 that you are after:
https://github.com/moby/moby/pull/33376

nhorman on 22 Oct 2017

@nhorman do you know if that's back-ported to an existing 1.12.6 version?

xThomo on 22 Oct 2017

@xThomo not off the top of my head, no, but it should be pretty easy to go and see, its a very small patch

nhorman on 22 Oct 2017

cheers @nhorman !

DanyC97 on 23 Oct 2017

@xThomo, @nhorman That should have been available since build docker-2:1.12.6-29.git97ba2c0

jwhonce on 23 Oct 2017

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot on 23 Feb 2018

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot on 25 Mar 2018

I don't see this issue anymore, plus the workaround steps unblock me anyway. Closing this issue. Please reopen if needed.

rajatchopra on 25 Mar 2018

Was this page helpful?

0 / 5 - 0 ratings