Origin: Pods stuck in Terminating in 3.2.0.5

Created on 21 Mar 2016 · 59Comments · Source: openshift/origin

Created a lot of pods (1000 across 2 nodes, approx 500 per node). Then deleted the namespace:

# oc delete ns clusterproject0

It started doing something because only 319 pods remain out of the 1000. But now it refuses to go further. Environment state and logs are below.

I haven't seen this before -- last similar run was on 3.2.0.1, though I only went to 250 pods per node at that code level (things worked ok).

# openshift version
openshift v3.2.0.5
kubernetes v1.2.0-36-g4a3f9c5
etcd 2.2.5

# oc get no
NAME                                       STATUS                     AGE
dell-r620-01.perf.lab.eng.rdu.redhat.com   Ready,SchedulingDisabled   19d
dell-r730-01.perf.lab.eng.rdu.redhat.com   Ready                      19d
dell-r730-02.perf.lab.eng.rdu.redhat.com   Ready                      19d

# oc get ns
NAME               STATUS        AGE
clusterproject0    Terminating   1d
default            Active        19d
management-infra   Active        19d
openshift          Active        19d
openshift-infra    Active        19d

# oc delete ns clusterproject0
Error from server: namespaces "clusterproject0" cannot be updated: The system is ensuring all content is removed from this namespace.  Upon completion, this namespace will automatically be purged by the system.

All of the stuck-in-terminating pods were scheduled and run on one of the nodes. The other node was able to successfully terminate all pods that were running there.

From the master:

Mar 21 12:18:36 dell-r620-01.perf.lab.eng.rdu.redhat.com atomic-openshift-master[6683]: E0321 12:18:36.451332    6683 namespace_controller.go:139] unexpected items still remain in namespace: clusterproject0 for gvr: { v1 pods}
Mar 21 12:18:37 dell-r620-01.perf.lab.eng.rdu.redhat.com atomic-openshift-master[6683]: W0321 12:18:37.223252    6683 reflector.go:289] /usr/lib/golang/src/runtime/asm_amd64.s:2232: watch of *api.ServiceAccount ended with: 401: The event in requested index is outdated and cleared (the requested history has been cleared [24072046/23943693]) [24073045]

From the node that can't terminate it's pods:

Mar 21 12:22:49 dell-r730-01.perf.lab.eng.rdu.redhat.com atomic-openshift-node[7438]: W0321 12:22:49.119784    7438 kubelet.go:1850] Unable to retrieve pull secret clusterproject0/default-dockercfg-woua1 for clusterproject0/hellopods505 due to secrets "default-dockercfg-woua1" not found.  The image pull may not succeed.

prioritP1

Source

jeremyeder

Most helpful comment

@metal3d oc delete pod/<name of pod> --grace-period=0 will force deletion.

ncdc on 17 Oct 2016

👍8

All 59 comments

@derekwaynecarr @liggitt

jeremyeder on 21 Mar 2016

Attached is a stacktrace taken from the wedged node, as requested by @ncdc

# curl -o stack.txt http://localhost:6060/debug/pprof/goroutine?debug=2

stack.txt

jeremyeder on 21 Mar 2016

the nodes reported healthy as well, so node controller is not going to do anything...

derekwaynecarr on 21 Mar 2016

what grace period is set for pods when deleting them as part of a namespace deletion?

liggitt on 21 Mar 2016

@liggitt - none. we call delete collection, which in turn calls delete, which will pull the grace period from the strategy. https://github.com/kubernetes/kubernetes/blob/master/pkg/registry/generic/etcd/etcd.go#L398

derekwaynecarr on 21 Mar 2016

so a stuck pod can block namespace deletion forever?

liggitt on 21 Mar 2016

For more context, see the long discussion here: https://github.com/kubernetes/kubernetes/issues/20497

ncdc on 21 Mar 2016

@jeremyeder did the pod use a pvc with pv?

derekwaynecarr on 21 Mar 2016

@derekwaynecarr no, there were no PVs in this environment at all. The pods simply run hello-openshift.

jeremyeder on 21 Mar 2016

Had to stop docker on the node with all the stuck pods, dmsetup remove all the thin volumes there, reboot everything and then the namespace was able to clean itself up :/

jeremyeder on 22 Mar 2016

@jeremyeder can you upload the logs for docker and the node somewhere (or maybe the entire journal from just before you deleted the project to a few minutes after)?

ncdc on 22 Mar 2016

This is the journal from the master:
mar20.syslog.txt.zip

This is the journal from the node that got wedged:
mar20.syslog_node.txt.zip

jeremyeder on 22 Mar 2016

I think the bug is that failure to fetch image pull secrets for a running pod will prevent our ability to sync that pod.

https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet.go#L1786

This will prevent this sync from ever happening:
https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/dockertools/manager.go#L1809

derekwaynecarr on 22 Mar 2016

I agree with @derekwaynecarr that this is the problem in this case.

pmorie on 22 Mar 2016

@jeremyeder were the pods actually running prior to trying to delete the namespace on the problematic node? Or were they ImagePullBackOff?

ncdc on 22 Mar 2016

@jeremyeder told me earlier that the pods were running, but I have to say that what I'm seeing in the node logs makes me think that either:

Looking exclusively at the running containers does not tell you which pods are running and which aren't
There's something missing in the logs

I think we need to try to reproduce this together and we need to poke around and look at both the state of the containers on the node and what docker containers are running.

pmorie on 22 Mar 2016

If we can reproduce, the output of oc get pods and docker ps -a would be helpful

ncdc on 22 Mar 2016

nope, getPullSecretsForPod never returns an error in the kubelet... it's best effort to load all available pull secrets. the log message for that is a red herring.

liggitt on 22 Mar 2016

There was duplicate similar log messages, tripped me up. thanks @liggitt

derekwaynecarr on 22 Mar 2016

There was duplicate similar log messages, tripped me up. thanks @liggitt

innocent! :)

deads2k on 22 Mar 2016

@deads2k famine made me miss the obvious before lunch. you are spared!

derekwaynecarr on 22 Mar 2016

@deads2k @liggitt @derekwaynecarr I feel like we should change the signature of getPullSecretsForPod if it is never supposed to return an error -- any objections to doing that as a follow-up?

pmorie on 22 Mar 2016

any objections to doing that as a follow-up?

Not really. I might have like "return what we know, plus an error indicating why it may be incomplete", but clearly I didn't like enough to actually do that.

deads2k on 22 Mar 2016

@pmorie :+1:

derekwaynecarr on 22 Mar 2016

Managed to wedge it again, at around the same trigger point. Run around 750-800 pods on one node, and then delete the namespace that they're in. Logs are available privately (too large to attach now that it's at loglevel=5).

jeremyeder on 23 Mar 2016

@jeremyeder thanks. will ping in IRC, will be good to have the node log, and a YAML representation of a pod that is wedged.

derekwaynecarr on 23 Mar 2016

Of the terminating 'pods', I manually chose one and killed the container and infra container. The pod was then removed from the API server, but the PLEG loop in kubelet continues to try and start the pod over and over again... suspect bug in main kubelet sync loop.

derekwaynecarr on 23 Mar 2016

@derekwaynecarr do you know why the containers were still alive? Or was it the case that the Kubelet was deleting/recreating infinitely?

ncdc on 23 Mar 2016

@ncdc - the kubelet never attempted to delete the container (bug 1), after manual death of containers, pod is removed from api server (good), but kubelet continues to try and resurrect the pod (bug 2). i suspect some internal state is messed up, working my way backwards.

derekwaynecarr on 23 Mar 2016

the pod worker continues to work on the pod that has been removed from the api server, and gets hung on errors tring to setup the network because the pod information cannot be found...

derekwaynecarr on 23 Mar 2016

so the manage pod loop is getting wedged when we are unable to start a network for a pod, and it keeps requeing the same item... in general, i think we are reporting an error somewhere where it should not be treated as error and the pod worker keeps requeing the same underlying task

derekwaynecarr on 23 Mar 2016

RemoveOrphanedStatuses does not appear to be removing orphaned pods based on the logs...

derekwaynecarr on 23 Mar 2016

So I restarted atomic-openshift-node on the environment that showed this behavior, and the system then properly cleared all its state and things were good. We need to get more logging in the kubelet to better diagnose why the kubelet's desired state was stale.

derekwaynecarr on 23 Mar 2016

To summarize what I can gather is the root issue:

the kubelet never sees that the pods in question have a deletion timestamp
as a result, the kubelet never kills those pods

the reason that the kubelet can miss the deletion timestamp is unclear at this time.

the underlying sync loop was doing the right thing when it sees no deletion timestamp on the pod.

derekwaynecarr on 23 Mar 2016

Should we move this conversation to the Kubernetes issue tracker?

On Wed, Mar 23, 2016 at 3:49 PM, Derek Carr [email protected]
wrote:

To summarize what I can gather is the root issue:

the kubelet never sees that the pods in question have a deletion
timestamp

as a result, the kubelet never kills those pods

the reason that the kubelet can miss the deletion timestamp is unclear at
this time.

the underlying sync loop was doing the right thing when it sees no
deletion timestamp on the pod.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/openshift/origin/issues/8176#issuecomment-200516185

-- Jeremy Eder

jeremyeder on 23 Mar 2016

we also meet this similar issue in our env, when delete the pod it shows the pod has been deleted, but it keep "Terminating". we create those pod by job.
openshift version: v3.2.0.7

[root@dhcp-128-70 test]# oc delete pods --all
pod "pi-0djsq" deleted
pod "pi-12lw1" deleted
pod "pi-1hoy8" deleted
pod "pi-1jjf9" deleted
pod "pi-2srbj" deleted
pod "pi-41bs7" deleted
pod "pi-4iead" deleted
pod "pi-4np77" deleted
pod "pi-5rphs" deleted
pod "pi-5y3di" deleted
pod "pi-6cimb" deleted
pod "pi-8432a" deleted
pod "pi-d24nl" deleted
pod "pi-dh8zh" deleted
pod "pi-dkliq" deleted
pod "pi-expdd" deleted
pod "pi-hy2lh" deleted
pod "pi-hz940" deleted
pod "pi-l3hfc" deleted
pod "pi-lhoyw" deleted
pod "pi-w0s4i" deleted
pod "pi-x4x9h" deleted
pod "pi-xflxq" deleted
pod "pi-xiq81" deleted
[root@dhcp-128-70 test]# oc get po
NAME       READY     STATUS        RESTARTS   AGE
pi-0djsq   0/1       Terminating   0          32m
pi-12lw1   0/1       Terminating   0          32m
pi-1hoy8   0/1       Terminating   0          32m
pi-1jjf9   0/1       Terminating   0          32m
pi-2srbj   0/1       Terminating   0          32m
pi-41bs7   0/1       Terminating   0          32m
pi-4iead   0/1       Terminating   0          32m
pi-4np77   0/1       Terminating   0          32m
pi-5rphs   0/1       Terminating   0          32m
pi-5y3di   0/1       Terminating   0          32m
pi-6cimb   0/1       Terminating   0          32m
pi-8432a   0/1       Terminating   0          32m
pi-d24nl   0/1       Terminating   0          32m
pi-dh8zh   0/1       Terminating   0          32m
pi-dkliq   0/1       Terminating   0          32m
pi-expdd   0/1       Terminating   0          32m
pi-hy2lh   0/1       Terminating   0          32m
pi-hz940   0/1       Terminating   0          32m
pi-l3hfc   0/1       Terminating   0          32m
pi-lhoyw   0/1       Terminating   0          32m
pi-w0s4i   0/1       Terminating   0          32m
pi-x4x9h   0/1       Terminating   0          32m
pi-xflxq   0/1       Terminating   0          32m
pi-xiq81   0/1       Terminating   0          32m

mdshuai on 25 Mar 2016

@mdshuai when you encounter this are all the pods on the same node?

derekwaynecarr on 30 Mar 2016

@mdshuai when you encounter this are all the pods on the same node?

@derekwaynecarr yes

mdshuai on 31 Mar 2016

I spent majority of time today trying to reproduce this scenario. If it
happens again for folks, please reach out to me with information on how I
can debug the system. Most notably, I want to try and schedule new pods to
a node that exhibits this problem to see if they ever run, and I want to
forcefully delete one of the pods from a wedged node to see if the
container is killed on the host. I suspect when this situation occurs the
Kubelet is not seeing any watch notifications delivered, but I need to
reproduce it reliably to better debug.

On Wednesday, March 30, 2016, DeShuai Ma [email protected] wrote:

@mdshuai https://github.com/mdshuai when you encounter this are all the
pods on the same node?

@derekwaynecarr https://github.com/derekwaynecarr yes

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/openshift/origin/issues/8176#issuecomment-203699417

derekwaynecarr on 31 Mar 2016

I was finally able to reproduce this on a 3 node cluster after many hours of creating 500 pods, waiting for 200 to run, and then tearing down the project. For many hours, it was fine, but I did notice nodes were getting less and less successful at getting the full set of 200 pods in a running state as test ran more and more, but it was always able to properly tear-down all the pods.

After much time, I was left in a state where finally 1 of the nodes failed to tear down 3 pods. The node continued to report a valid heartbeat back to the API server, but it would no longer launch new pods that were scheduled to it. I was able to ssh into the machine and do a little more sleuthing.

The kubelet actually did see the notification from the watch source about the pod:

I0331 20:15:30.501759    3952 kubelet.go:2394] SyncLoop (ADD, "api"): "test-034xr_e2e-tests-nodestress-8e5yg(f03e60df-f77c-11e5-bed9-080027242396)"
I0331 20:15:30.823634    3952 manager.go:1688] Need to restart pod infra container for "test-034xr_e2e-tests-nodestress-8e5yg(f03e60df-f77c-11e5-bed9-080027242396)" because it is not found
I0331 20:16:59.570316    3952 kubelet.go:2420] SyncLoop (PLEG): "test-034xr_e2e-tests-nodestress-8e5yg(f03e60df-f77c-11e5-bed9-080027242396)", event: &pleg.PodLifecycleEvent{ID:"f03e60df-f77c-11e5-bed9-080027242396", Type:"ContainerStarted", Data:"4749d3890056803ec6d599cd82391d459d8d47246bead00764cb590f2ea2991f"}
I0331 20:23:20.586986    3952 kubelet.go:2401] SyncLoop (UPDATE, "api"): "test-034xr_e2e-tests-nodestress-8e5yg(f03e60df-f77c-11e5-bed9-080027242396)"

Looking at the container in question:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubernetes.io/created-by: |
      {"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicationController","namespace":"e2e-tests-nodestress-8e5yg","name":"test","uid":"ededf40d-f77c-11e5-bed9-080027242396","apiVersion":"v1","resourceVersion":"18868"}}
  creationTimestamp: 2016-03-31T20:12:48Z
  deletionGracePeriodSeconds: 30
  deletionTimestamp: 2016-03-31T20:23:50Z
  generateName: test-
  labels:
    name: test
  name: test-034xr
  namespace: e2e-tests-nodestress-8e5yg
  resourceVersion: "19910"
  selfLink: /api/v1/namespaces/e2e-tests-nodestress-8e5yg/pods/test-034xr
  uid: f03e60df-f77c-11e5-bed9-080027242396
spec:
  containers:
  - image: openshift/hello-openshift
    imagePullPolicy: Always
    name: test
    resources: {}
    terminationMessagePath: /dev/termination-log
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-4s8wq
      readOnly: true
  dnsPolicy: ClusterFirst
  nodeName: kubernetes-node-1
  restartPolicy: Always
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  volumes:
  - name: default-token-4s8wq
    secret:
      secretName: default-token-4s8wq
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: 2016-03-31T20:15:30Z
    message: 'containers with unready status: [test]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  containerStatuses:
  - image: openshift/hello-openshift
    imageID: ""
    lastState: {}
    name: test
    ready: false
    restartCount: 0
    state:
      waiting:
        message: 'Image: openshift/hello-openshift is ready, container is creating'
        reason: ContainerCreating
  hostIP: 10.245.1.3
  phase: Pending
  startTime: 2016-03-31T20:15:30Z

You can see it was stuck in waiting state with container creating status.

Name:               test-034xr
Namespace:          e2e-tests-nodestress-8e5yg
Node:               kubernetes-node-1/10.245.1.3
Start Time:         Thu, 31 Mar 2016 16:15:30 -0400
Labels:             name=test
Status:             Terminating (expires Thu, 31 Mar 2016 16:23:50 -0400)
Termination Grace Period:   30s
IP:             
Controllers:            ReplicationController/test
Containers:
  test:
    Container ID:   
    Image:      openshift/hello-openshift
    Image ID:       
    Port:       
    QoS Tier:
      cpu:          BestEffort
      memory:           BestEffort
    State:          Waiting
      Reason:           ContainerCreating
    Ready:          False
    Restart Count:      0
    Environment Variables:  <none>
Conditions:
  Type      Status
  Ready     False 
Volumes:
  default-token-4s8wq:
    Type:   Secret (a volume populated by a Secret)
    SecretName: default-token-4s8wq
No events.

Looking at the docker logs:

time="2016-03-31T23:09:46.782184154Z" level=info msg="{Action=create, LoginUID=4294967295, PID=3952}"
time="2016-03-31T23:10:01.782790028Z" level=warning msg="Error getting v2 registry: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"

So my current theory is the following:

we are subject to this flake if a container is in creating state
docker is doing something unexpected

I have stashed the logs away to analyze more tomorrow.

derekwaynecarr on 1 Apr 2016

👍1

To clarify, only the pause container is left running on the lnode.

$ docker ps | grep test-034xr
4749d3890056        gcr.io/google_containers/pause:2.0                                     "/pause"                 2 hours ago         Up 2 hours                              k8s_POD.6059dfa2_test-034xr_e2e-tests-nodestress-8e5yg_f03e60df-f77c-11e5-bed9-080027242396_7a0e0878

derekwaynecarr on 1 Apr 2016

In addition, new pods that land on that kubelet will never launch when the kubelet + docker is in this state:

here is a log snippet for a new pod i created in a hung kubelet:

cat /var/log/kubelet-debug.log | grep nginx-2040093540-1n7hd
I0331 22:32:02.660718    3952 kubelet.go:2394] SyncLoop (ADD, "api"): "nginx-2040093540-1n7hd_default(63a958c6-f790-11e5-bed9-080027242396)"
I0331 22:32:02.754589    3952 manager.go:1688] Need to restart pod infra container for "nginx-2040093540-1n7hd_default(63a958c6-f790-11e5-bed9-080027242396)" because it is not found
I0331 22:32:04.486144    3952 kubelet.go:2420] SyncLoop (PLEG): "nginx-2040093540-1n7hd_default(63a958c6-f790-11e5-bed9-080027242396)", event: &pleg.PodLifecycleEvent{ID:"63a958c6-f790-11e5-bed9-080027242396", Type:"ContainerStarted", Data:"928a37eb37bf4e063b5eca17156a8172d5a109f586226e0ef99e5aeaa46fb033"}

The container is never launched on the node though even though it saw the add.

I then terminated the pod.

The kubelet log showed the update was delivered:

# journalctl --no-pager -o cat -u kubelet | grep nginx-2040093540-1n7hd
I0331 22:32:02.660718    3952 kubelet.go:2394] SyncLoop (ADD, "api"): "nginx-2040093540-1n7hd_default(63a958c6-f790-11e5-bed9-080027242396)"
I0331 22:32:02.754589    3952 manager.go:1688] Need to restart pod infra container for "nginx-2040093540-1n7hd_default(63a958c6-f790-11e5-bed9-080027242396)" because it is not found
I0331 22:32:04.486144    3952 kubelet.go:2420] SyncLoop (PLEG): "nginx-2040093540-1n7hd_default(63a958c6-f790-11e5-bed9-080027242396)", event: &pleg.PodLifecycleEvent{ID:"63a958c6-f790-11e5-bed9-080027242396", Type:"ContainerStarted", Data:"928a37eb37bf4e063b5eca17156a8172d5a109f586226e0ef99e5aeaa46fb033"}
I0331 23:18:07.412580    3952 kubelet.go:2401] SyncLoop (UPDATE, "api"): "nginx-2040093540-1n7hd_default(63a958c6-f790-11e5-bed9-080027242396)"

But the pod is never terminated, and its pause container remains:

# docker ps | grep nginx-2040093540-1n7hd
928a37eb37bf        gcr.io/google_containers/pause:2.0                                     "/pause"                 50 minutes ago      Up 50 minutes                           k8s_POD.6059dfa2_nginx-2040093540-1n7hd_default_63a958c6-f790-11e5-bed9-080027242396_ce888928

The above pod was for a container not on the local machine, so I chose an image that was definitely on the machine. docker fails to start it, and messages like following in logs.

[root@kubernetes-node-1 vagrant]# cat /var/log/docker-debug-2 | grep -C 3 hello-1348166448-uhvak
time="2016-03-31T23:23:56.380365056Z" level=info msg="{Action=create, LoginUID=4294967295, PID=3952}"
time="2016-03-31T23:23:56.578526656Z" level=error msg="Handler for GET /containers/8290a58a1ee335171ada06b932b6b46670808a8a096df72bd839966e8b968e6f/json returned error: Unknown device 8290a58a1ee335171ada06b932b6b46670808a8a096df72bd839966e8b968e6f"
time="2016-03-31T23:23:56.579037069Z" level=error msg="HTTP Error" err="Unknown device 8290a58a1ee335171ada06b932b6b46670808a8a096df72bd839966e8b968e6f" statusCode=500
time="2016-03-31T23:23:56.970677495Z" level=info msg="{Action=start, ID=8290a58a1ee335171ada06b932b6b46670808a8a096df72bd839966e8b968e6f, LoginUID=4294967295, PID=3952, Config={Hostname=hello-1348166448-uhvak, AttachStdin=false, AttachStdout=false, AttachStderr=false, Tty=false, OpenStdin=false, StdinOnce=false, Env=[KUBERNETES_SERVICE_PORT_HTTPS=443 KUBERNETES_PORT=tcp://10.247.0.1:443 KUBERNETES_PORT_443_TCP=tcp://10.247.0.1:443 KUBERNETES_PORT_443_TCP_PROTO=tcp KUBERNETES_PORT_443_TCP_PORT=443 KUBERNETES_PORT_443_TCP_ADDR=10.247.0.1 KUBERNETES_SERVICE_HOST=10.247.0.1 KUBERNETES_SERVICE_PORT=443], Image=gcr.io/google_containers/pause:2.0, Entrypoint={parts:[/pause]}, NetworkDisabled=false, Labels=map[io.kubernetes.container.restartCount:0 io.kubernetes.container.terminationMessagePath: io.kubernetes.pod.name:hello-1348166448-uhvak io.kubernetes.pod.namespace:default io.kubernetes.pod.terminationGracePeriod:30 io.kubernetes.pod.uid:a39fc7e2-f797-11e5-bed9-080027242396 io.kubernetes.container.hash:6059dfa2 io.kubernetes.container.name:POD]}, HostConfig={MemorySwap=-1, CPUShares=2, OomKillDisable=false, Privileged=false, PublishAllPorts=false, DNS=[10.247.0.10], DNSSearch=[default.svc.cluster.local svc.cluster.local cluster.local redhat.com], NetworkMode=default, ReadonlyRootfs=false, LogConfig={Type:json-file Config:map[]}}}"
2016/03/31 23:23:57 http: multiple response.WriteHeader calls
time="2016-03-31T23:24:02.326389811Z" level=warning msg="Error getting v2 registry: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
time="2016-03-31T23:24:22.348414865Z" level=info msg="{Action=create, LoginUID=4294967295, PID=3952}"

so basically, docker is saying its healthy (its not), and kubelet is healthy (it is)

derekwaynecarr on 1 Apr 2016

So yeah, it looks like if docker does not return with an image pull error or a failure to start error:

time="2016-04-01T01:00:50.411236569Z" level=info msg="{Action=create, LoginUID=4294967295, PID=3952}"
time="2016-04-01T01:00:50.577583649Z" level=error msg="Handler for GET /containers/052afa147d67f028d4245666bb1e4b735aacad3f867ebc3c5d8564b5b902a96f/json returned error: Unknown device 052afa147d67f028d4245666bb1e4b735aacad3f867ebc3c5d8564b5b902a96f"
time="2016-04-01T01:00:50.577666266Z" level=error msg="HTTP Error" err="Unknown device 052afa147d67f028d4245666bb1e4b735aacad3f867ebc3c5d8564b5b902a96f" statusCode=500
time="2016-04-01T01:00:50.924279733Z" level=info msg="{Action=start, ID=052afa147d67f028d4245666bb1e4b735aacad3f867ebc3c5d8564b5b902a96f, LoginUID=4294967295, PID=3952, Config={Hostname=testid-794189027-qd2kr, AttachStdin=false, AttachStdout=false, AttachStderr=false, Tty=false, OpenStdin=false, StdinOnce=false, Env=[KUBERNETES_SERVICE_PORT_HTTPS=443 KUBERNETES_PORT=tcp://10.247.0.1:443 KUBERNETES_PORT_443_TCP=tcp://10.247.0.1:443 KUBERNETES_PORT_443_TCP_PROTO=tcp KUBERNETES_PORT_443_TCP_PORT=443 KUBERNETES_PORT_443_TCP_ADDR=10.247.0.1 KUBERNETES_SERVICE_HOST=10.247.0.1 KUBERNETES_SERVICE_PORT=443], Image=gcr.io/google_containers/pause:2.0, Entrypoint={parts:[/pause]}, NetworkDisabled=false, Labels=map[io.kubernetes.container.hash:6059dfa2 io.kubernetes.container.name:POD io.kubernetes.container.restartCount:0 io.kubernetes.container.terminationMessagePath: io.kubernetes.pod.name:testid-794189027-qd2kr io.kubernetes.pod.namespace:default io.kubernetes.pod.terminationGracePeriod:30 io.kubernetes.pod.uid:2d10889d-f7a5-11e5-bed9-080027242396]}, HostConfig={MemorySwap=-1, CPUShares=2, OomKillDisable=false, Privileged=false, PublishAllPorts=false, DNS=[10.247.0.10], DNSSearch=[default.svc.cluster.local svc.cluster.local cluster.local redhat.com], NetworkMode=default, ReadonlyRootfs=false, LogConfig={Type:json-file Config:map[]}}}"
2016/04/01 01:00:51 http: multiple response.WriteHeader calls
time="2016-04-01T01:00:56.059338295Z" level=warning msg="Error getting v2 registry: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
time="2016-04-01T01:01:16.079062058Z" level=info msg="{Action=create, LoginUID=4294967295, PID=3952}"
time="2016-04-01T01:01:31.080430011Z" level=warning msg="Error getting v2 registry: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
time="2016-04-01T01:01:51.103674671Z" level=info msg="{Action=create, LoginUID=4294967295, PID=3952}"
time="2016-04-01T01:02:06.106075458Z" level=warning msg="Error getting v2 registry: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
time="2016-04-01T01:02:26.127051624Z" level=info msg="{Action=create, LoginUID=4294967295, PID=3952}"
time="2016-04-01T01:02:41.127613194Z" level=warning msg="Error getting v2 registry: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
time="2016-04-01T01:03:01.153306352Z" level=info msg="{Action=create, LoginUID=4294967295, PID=3952}"
time="2016-04-01T01:03:16.153835194Z" level=warning msg="Error getting v2 registry: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
time="2016-04-01T01:03:36.172615877Z" level=info msg="{Action=create, LoginUID=4294967295, PID=3952}"
time="2016-04-01T01:03:51.173271101Z" level=warning msg="Error getting v2 registry: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"

derekwaynecarr on 1 Apr 2016

If I run an image using a specified image:imageid on a node that exhibits this behavior, the kubelet will launch the container correctly if the image was already pulled, and docker will launch the container correctly. If I docker run on the node directly when it has this behavior an already pulled image, it will run. I suspect there is an issue right now in the kubelet when docker can no longer connect to the hub somewhere in the image puller that is causing a deadlock on the pods in question that results in them failing to start/terminate or properly get their state updated to reflect an error in the image pull.

derekwaynecarr on 1 Apr 2016

As expected, I now see this:

E0401 00:47:50.612585    3952 event.go:198] Server rejected event '&api.Event{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"test-034xr.144115c075d63dae", GenerateName:"", Namespace:"e2e-tests-nodestress-8e5yg", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil)}, InvolvedObject:api.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-nodestress-8e5yg", Name:"test-034xr", UID:"f03e60df-f77c-11e5-bed9-080027242396", APIVersion:"v1", ResourceVersion:"19138", FieldPath:""}, Reason:"FailedSync", Message:"Error syncing pod, skipping: failed to \"StartContainer\" for \"test\" with ErrImagePull: \"Error while pulling image: Get https://index.docker.io/v1/repositories/openshift/hello-openshift/images: dial tcp: lookup index.docker.io: no such host\"\n", Source:api.EventSource{Component:"kubelet", Host:"kubernetes-node-1"}, FirstTimestamp:unversioned.Time{Time:time.Time{sec:63595068470, nsec:599630254, loc:(*time.Location)(0x2e77fe0)}}, LastTimestamp:unversioned.Time{Time:time.Time{sec:63595068470, nsec:599630254, loc:(*time.Location)(0x2e77fe0)}}, Count:1, Type:"Warning"}': 'events "test-034xr.144115c075d63dae" is forbidden: Unable to create new content in namespace e2e-tests-nodestress-8e5yg because it is being terminated.' (will not retry!)

But the pod status is not being updated by the kubelet to reflect that...

derekwaynecarr on 1 Apr 2016

Ha funny enough, I am not sure what caused it to unwedge, but now the pod that was hung for hours is disappeared along with namespace.

derekwaynecarr on 1 Apr 2016

Basically, for this pod in question, after 4 hrs it finally unwedged:

[root@kubernetes-node-1 vagrant]# journalctl -u kubelet -o cat --no-pager | grep "test-034xr"
I0331 20:15:30.501759    3952 kubelet.go:2394] SyncLoop (ADD, "api"): "test-034xr_e2e-tests-nodestress-8e5yg(f03e60df-f77c-11e5-bed9-080027242396)"
I0331 20:15:30.823634    3952 manager.go:1688] Need to restart pod infra container for "test-034xr_e2e-tests-nodestress-8e5yg(f03e60df-f77c-11e5-bed9-080027242396)" because it is not found
I0331 20:16:59.570316    3952 kubelet.go:2420] SyncLoop (PLEG): "test-034xr_e2e-tests-nodestress-8e5yg(f03e60df-f77c-11e5-bed9-080027242396)", event: &pleg.PodLifecycleEvent{ID:"f03e60df-f77c-11e5-bed9-080027242396", Type:"ContainerStarted", Data:"4749d3890056803ec6d599cd82391d459d8d47246bead00764cb590f2ea2991f"}
I0331 20:23:20.586986    3952 kubelet.go:2401] SyncLoop (UPDATE, "api"): "test-034xr_e2e-tests-nodestress-8e5yg(f03e60df-f77c-11e5-bed9-080027242396)"
E0401 00:47:15.578764    3952 event.go:198] Server rejected event '&api.Event{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"test-034xr.144115b84e3de940", GenerateName:"", Namespace:"e2e-tests-nodestress-8e5yg", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil)}, InvolvedObject:api.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-nodestress-8e5yg", Name:"test-034xr", UID:"f03e60df-f77c-11e5-bed9-080027242396", APIVersion:"v1", ResourceVersion:"19138", FieldPath:"spec.containers{test}"}, Reason:"Pulling", Message:"pulling image \"openshift/hello-openshift\"", Source:api.EventSource{Component:"kubelet", Host:"kubernetes-node-1"}, FirstTimestamp:unversioned.Time{Time:time.Time{sec:63595068435, nsec:575597376, loc:(*time.Location)(0x2e77fe0)}}, LastTimestamp:unversioned.Time{Time:time.Time{sec:63595068435, nsec:575597376, loc:(*time.Location)(0x2e77fe0)}}, Count:1, Type:"Normal"}': 'events "test-034xr.144115b84e3de940" is forbidden: Unable to create new content in namespace e2e-tests-nodestress-8e5yg because it is being terminated.' (will not retry!)
E0401 00:47:50.605020    3952 event.go:198] Server rejected event '&api.Event{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"test-034xr.144115c075d4b533", GenerateName:"", Namespace:"e2e-tests-nodestress-8e5yg", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil)}, InvolvedObject:api.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-nodestress-8e5yg", Name:"test-034xr", UID:"f03e60df-f77c-11e5-bed9-080027242396", APIVersion:"v1", ResourceVersion:"19138", FieldPath:"spec.containers{test}"}, Reason:"Failed", Message:"Failed to pull image \"openshift/hello-openshift\": Error while pulling image: Get https://index.docker.io/v1/repositories/openshift/hello-openshift/images: dial tcp: lookup index.docker.io: no such host", Source:api.EventSource{Component:"kubelet", Host:"kubernetes-node-1"}, FirstTimestamp:unversioned.Time{Time:time.Time{sec:63595068470, nsec:599529779, loc:(*time.Location)(0x2e77fe0)}}, LastTimestamp:unversioned.Time{Time:time.Time{sec:63595068470, nsec:599529779, loc:(*time.Location)(0x2e77fe0)}}, Count:1, Type:"Warning"}': 'events "test-034xr.144115c075d4b533" is forbidden: Unable to create new content in namespace e2e-tests-nodestress-8e5yg because it is being terminated.' (will not retry!)
E0401 00:47:50.612585    3952 event.go:198] Server rejected event '&api.Event{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"test-034xr.144115c075d63dae", GenerateName:"", Namespace:"e2e-tests-nodestress-8e5yg", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil)}, InvolvedObject:api.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-nodestress-8e5yg", Name:"test-034xr", UID:"f03e60df-f77c-11e5-bed9-080027242396", APIVersion:"v1", ResourceVersion:"19138", FieldPath:""}, Reason:"FailedSync", Message:"Error syncing pod, skipping: failed to \"StartContainer\" for \"test\" with ErrImagePull: \"Error while pulling image: Get https://index.docker.io/v1/repositories/openshift/hello-openshift/images: dial tcp: lookup index.docker.io: no such host\"\n", Source:api.EventSource{Component:"kubelet", Host:"kubernetes-node-1"}, FirstTimestamp:unversioned.Time{Time:time.Time{sec:63595068470, nsec:599630254, loc:(*time.Location)(0x2e77fe0)}}, LastTimestamp:unversioned.Time{Time:time.Time{sec:63595068470, nsec:599630254, loc:(*time.Location)(0x2e77fe0)}}, Count:1, Type:"Warning"}': 'events "test-034xr.144115c075d63dae" is forbidden: Unable to create new content in namespace e2e-tests-nodestress-8e5yg because it is being terminated.' (will not retry!)
I0401 00:47:51.599122    3952 manager.go:1368] Killing container "4749d3890056803ec6d599cd82391d459d8d47246bead00764cb590f2ea2991f e2e-tests-nodestress-8e5yg/test-034xr" with 30 second grace period
I0401 00:47:51.621944    3952 kubelet.go:2401] SyncLoop (UPDATE, "api"): "test-034xr_e2e-tests-nodestress-8e5yg(f03e60df-f77c-11e5-bed9-080027242396)"
I0401 00:47:51.635796    3952 kubelet.go:2404] SyncLoop (REMOVE, "api"): "test-034xr_e2e-tests-nodestress-8e5yg(f03e60df-f77c-11e5-bed9-080027242396)"
I0401 00:47:51.640217    3952 kubelet.go:2235] Killing unwanted pod "test-034xr"
I0401 00:47:51.642503    3952 manager.go:1368] Killing container "4749d3890056803ec6d599cd82391d459d8d47246bead00764cb590f2ea2991f e2e-tests-nodestress-8e5yg/test-034xr" with 0 second grace period
I0401 00:47:51.647741    3952 manager.go:1402] Container "4749d3890056803ec6d599cd82391d459d8d47246bead00764cb590f2ea2991f e2e-tests-nodestress-8e5yg/test-034xr" termination failed after 5.203603ms: API error (500): Cannot stop container 4749d3890056803ec6d599cd82391d459d8d47246bead00764cb590f2ea2991f: active container for 4749d3890056803ec6d599cd82391d459d8d47246bead00764cb590f2ea2991f does not exist
E0401 00:47:51.647903    3952 kubelet.go:2238] Failed killing the pod "test-034xr": failed to "KillContainer" for "POD" with KillContainerError: "API error (500): Cannot stop container 4749d3890056803ec6d599cd82391d459d8d47246bead00764cb590f2ea2991f: active container for 4749d3890056803ec6d599cd82391d459d8d47246bead00764cb590f2ea2991f does not exist\n\n"
I0401 00:47:51.827940    3952 manager.go:1400] Container "4749d3890056803ec6d599cd82391d459d8d47246bead00764cb590f2ea2991f e2e-tests-nodestress-8e5yg/test-034xr" exited after 228.768953ms
W0401 00:47:51.827996    3952 manager.go:1406] No ref for pod '"4749d3890056803ec6d599cd82391d459d8d47246bead00764cb590f2ea2991f e2e-tests-nodestress-8e5yg/test-034xr"'
I0401 00:47:53.015221    3952 kubelet.go:2235] Killing unwanted pod "test-034xr"

Other pods on this node still remain wedged though.

derekwaynecarr on 1 Apr 2016

When temporary isn't temporary, it appears bad things happen:

https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/container/image_puller.go#L112

derekwaynecarr on 1 Apr 2016

So if a kubelet finds itself with a pod in this status:

https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet.go#L3382

This state is basically the kubelet saying i don't know what state i am in exactly, so I think I need to look at this more tomorrow to see if we can get a more targeted test scenario that reproduces that state which results in a stuck pod when docker is not able to speak to the hub.

derekwaynecarr on 1 Apr 2016

I see no good reason for us to treat RegistryUnavailable differently than any other error since in some cases the registry will never become available. I have now decided this is the most likely source of the wedge, and we should report that as a normal image pull error so the container can properly terminate.

Tomorrow, I will try to mock out an environment where we can simulate a pet image name whose pull always returns RegistryUnavailable to see if that can reproduce the scenario where the pod is not terminating as expected.

derekwaynecarr on 1 Apr 2016

Great... so restarting docker unwedged the kubelet, and pods terminated as expected. Restarting docker again where it still no longer had network connectivity to the registry did not result in pods getting wedged when I created a handful on the node and then deleted the namespace. Creating 100 pods on the node where docker did not have connectivity to the registry did re-wedge the node and waiting 10 minutes before terminating the namespace did end up leaving 1 pod in a terminating state for what is now 18 minutes but I was an idiot and forgot to bump up the log level. So it looks like this is an issue if you have a kubelet at density that cannot contact the image registry and not just a kubelet running a few pods.

derekwaynecarr on 1 Apr 2016

I can now reliably reproduce this error.

Modified the kubelet to return RegistryUnavailable when pulling openshift/hello-openshift, launched a rc with 3 replicas, waited a minute or two for pods to appear hung in ContainerCreating state, delete the namespace, and voila! the pods never get deleted!

derekwaynecarr on 1 Apr 2016

Upstream PR: https://github.com/kubernetes/kubernetes/pull/23746

derekwaynecarr on 1 Apr 2016

👍1

and... any way to delete that pods ? I've got so much pod in that state...

metal3d on 16 Oct 2016

@metal3d oc delete pod/<name of pod> --grace-period=0 will force deletion.

ncdc on 17 Oct 2016

👍8

A Big thanks ! @ncdc

metal3d on 17 Oct 2016

I have 3 pods in Terminating status. Even the command from @ncdc hangs forever :(

anandbhaskaran on 27 Mar 2018

Here is the detailed info:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x58 pc=0x1952993]

goroutine 1 [running]:
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl.ReaperFor(0x0, 0x0, 0x3c2d039, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc421abe090)
    /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/delete.go:82 +0x1373
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/cmd/util.(*ring1Factory).Reaper(0xc420c43590, 0xc4209caa10, 0x0, 0x0, 0xc421a9c001, 0x1)
    /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/cmd/util/factory_object_mapping.go:302 +0x151
github.com/openshift/origin/pkg/oc/cli/util/clientcmd.(*ring1Factory).Reaper(0xc42039d420, 0xc4209caa10, 0x0, 0xc42164d608, 0x4f204a, 0xc420082000)
    /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/pkg/oc/cli/util/clientcmd/factory_object_mapping.go:287 +0x93d
github.com/openshift/origin/pkg/oc/cli/util/clientcmd.(*Factory).Reaper(0xc420c435c0, 0xc4209caa10, 0x0, 0x0, 0x0, 0x0)
    <autogenerated>:1 +0x47
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/cmd.ReapResult.func1(0xc4209cac40, 0x0, 0x0, 0x79c17e0, 0x4c47a6a)
    /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/cmd/delete.go:250 +0xe9
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource.ContinueOnErrorVisitor.Visit.func1(0xc4209cac40, 0x0, 0x0, 0x0, 0x0)
    /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource/visitor.go:396 +0x164
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource.DecoratedVisitor.Visit.func1(0xc4209cac40, 0x0, 0x0, 0x7, 0x1)
    /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource/visitor.go:372 +0xe7
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource.FlattenListVisitor.Visit.func1(0xc4209cac40, 0x0, 0x0, 0xc421abba20, 0x7f10363e6228)
    /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource/visitor.go:433 +0x4fe
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource.EagerVisitorList.Visit.func1(0xc4209cac40, 0x0, 0x0, 0x1, 0xc421abba20)
    /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource/visitor.go:255 +0x164
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource.(*Info).Visit(0xc4209cac40, 0xc421abba20, 0x28, 0x464dba0)
    /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource/visitor.go:105 +0x42
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource.EagerVisitorList.Visit(0xc421a9c1a0, 0x1, 0x1, 0xc421abe2a0, 0x1, 0xc421abe2a0)
    /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource/visitor.go:250 +0xea
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource.(*EagerVisitorList).Visit(0xc421abb960, 0xc421abe2a0, 0x7f103654ad90, 0x0)
    <autogenerated>:1 +0x58
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource.FlattenListVisitor.Visit(0x79f65a0, 0xc421abb960, 0xc420cf0980, 0xc4213cc4c0, 0x60000000001, 0xc4213cc4c0)
    /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource/visitor.go:428 +0x9e
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource.(*FlattenListVisitor).Visit(0xc421abb980, 0xc4213cc4c0, 0x28, 0xc420034700)
    <autogenerated>:1 +0x58
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource.DecoratedVisitor.Visit(0x79f6620, 0xc421abb980, 0xc421a9c1b0, 0x2, 0x2, 0xc421abb9e0, 0xc421a9c101, 0xc421abb9e0)
    /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource/visitor.go:363 +0x9b
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource.(*DecoratedVisitor).Visit(0xc421abe270, 0xc421abb9e0, 0xc421a9c1c0, 0xc42164dbc8)
    <autogenerated>:1 +0x62
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource.ContinueOnErrorVisitor.Visit(0x79f6520, 0xc421abe270, 0xc4209cacb0, 0x0, 0x2)
    /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource/visitor.go:391 +0xe4
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource.(*ContinueOnErrorVisitor).Visit(0xc421a9c1c0, 0xc4209cacb0, 0x415e18, 0x70)
    <autogenerated>:1 +0x4f
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource.(*Result).Visit(0xc420463e80, 0xc4209cacb0, 0xc420cf6050, 0x7a36460)
    /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource/result.go:98 +0x62
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/cmd.ReapResult(0xc420463e80, 0x7a4a1e0, 0xc420c435c0, 0x79efaa0, 0xc42000e018, 0xc420250001, 0x0, 0x1, 0xc421300001, 0x7a36460, ...)
    /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/cmd/delete.go:245 +0x16c
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/cmd.(*DeleteOptions).RunDelete(0xc42126ae70, 0xc420473200, 0x0)
    /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/cmd/delete.go:235 +0xd2
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/cmd.NewCmdDelete.func1(0xc420473200, 0xc42025ee00, 0x1, 0x2)
    /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/cmd/delete.go:142 +0x178
github.com/openshift/origin/vendor/github.com/spf13/cobra.(*Command).execute(0xc420473200, 0xc42025ecc0, 0x2, 0x2, 0xc420473200, 0xc42025ecc0)
    /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/spf13/cobra/command.go:603 +0x234
github.com/openshift/origin/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc4203d4d80, 0x202c551, 0xc4203d4d80, 0xc420238270)
    /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/spf13/cobra/command.go:689 +0x2fe
github.com/openshift/origin/vendor/github.com/spf13/cobra.(*Command).Execute(0xc4203d4d80, 0x2, 0xc4203d4d80)
    /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/spf13/cobra/command.go:648 +0x2b
main.main()
    /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/cmd/oc/oc.go:41 +0x293

anandbhaskaran on 27 Mar 2018

@anandbaskaran if command provided by @ncdc still hangs, you can try forcing it with --force. It gives:
oc delete pod/ --grace-period=0 --force