Origin: Pods Stuck "Terminating"

Created on 6 Sep 2017  Â·  13Comments  Â·  Source: openshift/origin

Version

oc v3.6.0+c4dd4cf
kubernetes v1.6.1+5115d708d7
features: Basic-Auth

Server https://console.outtherelabs.com:443
openshift v3.6.0+c4dd4cf
kubernetes v1.6.1+5115d708d7

Steps To Reproduce

Unknown, here is our setup:

  1. Set up a cluster using openshift-ansible and CentOS on AWS.
  2. Run a dozen deploy configs, stateful sets and daemon sets, etc. across a few namespaces.
  3. Attach dynamic PVCs to a bunch of the applications (results in EBS volumes).
Current Result

Most of the pods are working fine, but when terminating some pods they get stuck in the Terminating state and never terminate. Node logs have entries like this one:

Sep  5 19:17:22 ip-10-0-1-184 origin-node: E0905 19:17:22.043257  112306 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/secret/182285ee-9267-11e7-b7be-06415eb17bbf-default-token-f18hx\" (\"182285ee-9267-11e7-b7be-06415eb17bbf\")" failed. No retries permitted until 2017-09-05 19:17:22.543230782 +0000 UTC (durationBeforeRetry 500ms). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/secret/182285ee-9267-11e7-b7be-06415eb17bbf-default-token-f18hx" (volume.spec.Name: "default-token-f18hx") pod "182285ee-9267-11e7-b7be-06415eb17bbf" (UID: "182285ee-9267-11e7-b7be-06415eb17bbf") with: remove /var/lib/origin/openshift.local.volumes/pods/182285ee-9267-11e7-b7be-06415eb17bbf/volumes/kubernetes.io~secret/default-token-f18hx: device or resource busy
Expected Result

Pods terminate properly.

Additional Information

That path is not mounted (running mount does not list it) and running fuser -v on that directory does not show anything. Trying to rmdir results in a similar error:

bash sudo rmdir /var/lib/origin/openshift.local.volumes/pods/182285ee-9267-11e7-b7be-06415eb17bbf/volumes/kubernetes.io~secret/default-token-f18hx rmdir: failed to remove ‘/var/lib/origin/openshift.local.volumes/pods/182285ee-9267-11e7-b7be-06415eb17bbf/volumes/kubernetes.io~secret/default-token-f18hx’: Device or resource busy

Most helpful comment

Same here. Restarting docker seams to "resolve" the issue.

All 13 comments

Same here. Restarting docker seams to "resolve" the issue.

Same here . On RHEL Atomic Host. Restarting "docker" clears up pods stuck
in terminating.

Subhendu

On Sep 5, 2017 18:18, "Mateus Caruccio" notifications@github.com wrote:

Same here. Restarting docker seams to "resolve" the issue.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/openshift/origin/issues/16160#issuecomment-327319678,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AGpw3FnUynHTeLa9wOccdQrQvCWllc4cks5sfci9gaJpZM4PNl6L
.

I'm also getting logs like this:

Unable to mount volumes for pod "podname-89a37690-0cltf_namespace(a6d1d6cb-9307-11e7-b7be-06415eb17bbf)": timeout expired waiting for volumes to attach/mount for pod "podname"/"podname-89a37690-0cltf". list of unattached/unmounted volumes=[default-token-lplrt]

but the pods eventually come up. Don't know if that is related.

OK I think I figured out what was causing it.

We had set up Prometheus with https://github.com/prometheus/node_exporter

I saw this documentation docker page https://docs.docker.com/engine/admin/troubleshooting_volume_errors/ which mentioned containers using statfs causing issues. The Prometheus exporter seems to be doing that.

Disabling the Promethes exporter caused my stuck pods to terminate correctly.

I am going to try again with https://github.com/openshift/origin/pull/16096 and see if that config works for us.

How was node_exporter configured? Were you mounting /var/lib/docker?

@smarterclayton I don't think I was, I blew it out but I believe it was this config: https://coreos.com/assets/blog/promk8s/node-exporter.yaml

There is a known issue where volumes that mounted /var/lib/docker or other paths that can contain mount namespaces could cause the kernel to refuse to unmount / leak the specific namespaces. That's partially why i asked about statfs.

Does it includes /var/lib/docker/foo ?

Em 7 de set de 2017 22:59, "Clayton Coleman" notifications@github.com
escreveu:

There is a known issue where volumes that mounted /var/lib/docker or other
paths that can contain mount namespaces could cause the kernel to refuse to
unmount / leak the specific namespaces. That's partially why i asked about
statfs.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/openshift/origin/issues/16160#issuecomment-327978187,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAhBCBbik6z7zAWCnrAZx1n3eIBMpcBeks5sgJ9ygaJpZM4PNl6L
.

I believe it would be any of the subdirectories of /var/lib/docker that
have mount data (volumes, images, layers). If any of those are mounted, I
think it's possible to leak in some cases.

On Thu, Sep 7, 2017 at 10:53 PM, Mateus Caruccio notifications@github.com
wrote:

Does it includes /var/lib/docker/foo ?

Em 7 de set de 2017 22:59, "Clayton Coleman" notifications@github.com
escreveu:

There is a known issue where volumes that mounted /var/lib/docker or other
paths that can contain mount namespaces could cause the kernel to refuse to
unmount / leak the specific namespaces. That's partially why i asked about
statfs.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/openshift/origin/issues/16160#issuecomment-327978187,
or mute the thread
AAhBCBbik6z7zAWCnrAZx1n3eIBMpcBeks5sgJ9ygaJpZM4PNl6L>
.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/openshift/origin/issues/16160#issuecomment-327985278,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p2lJiKKgrr8_1OmZBmXh74OVL9oxks5sgKw2gaJpZM4PNl6L
.

OK found the actual config I was using: https://github.com/wkulhanek/openshift-prometheus/blob/master/node-exporter/node-exporter.yaml

It doesn't specifically mount /var/lib/docker but it does mount / as /rootfs.

A few other configs specifically add /var/lib/docker to collector.filesystem.ignored-mount-points which might be a good idea.

oci-umount is the package we have to attempt to cleanup the leaks.

After running the DaemonSet from https://github.com/openshift/origin/pull/16096 for 2 days I have not seen this issue again. I would assume that not mounting / solved it.

Since my issue is resolved I am fine closing this ticket.

Thanks, will continue to to watch for it.

Was this page helpful?
0 / 5 - 0 ratings