Created a lot of pods (1000 across 2 nodes, approx 500 per node). Then deleted the namespace:
# oc delete ns clusterproject0
It started doing something because only 319 pods remain out of the 1000. But now it refuses to go further. Environment state and logs are below.
I haven't seen this before -- last similar run was on 3.2.0.1, though I only went to 250 pods per node at that code level (things worked ok).
# openshift version
openshift v3.2.0.5
kubernetes v1.2.0-36-g4a3f9c5
etcd 2.2.5
# oc get no
NAME STATUS AGE
dell-r620-01.perf.lab.eng.rdu.redhat.com Ready,SchedulingDisabled 19d
dell-r730-01.perf.lab.eng.rdu.redhat.com Ready 19d
dell-r730-02.perf.lab.eng.rdu.redhat.com Ready 19d
# oc get ns
NAME STATUS AGE
clusterproject0 Terminating 1d
default Active 19d
management-infra Active 19d
openshift Active 19d
openshift-infra Active 19d
# oc delete ns clusterproject0
Error from server: namespaces "clusterproject0" cannot be updated: The system is ensuring all content is removed from this namespace. Upon completion, this namespace will automatically be purged by the system.
All of the stuck-in-terminating pods were scheduled and run on one of the nodes. The other node was able to successfully terminate all pods that were running there.
From the master:
Mar 21 12:18:36 dell-r620-01.perf.lab.eng.rdu.redhat.com atomic-openshift-master[6683]: E0321 12:18:36.451332 6683 namespace_controller.go:139] unexpected items still remain in namespace: clusterproject0 for gvr: { v1 pods}
Mar 21 12:18:37 dell-r620-01.perf.lab.eng.rdu.redhat.com atomic-openshift-master[6683]: W0321 12:18:37.223252 6683 reflector.go:289] /usr/lib/golang/src/runtime/asm_amd64.s:2232: watch of *api.ServiceAccount ended with: 401: The event in requested index is outdated and cleared (the requested history has been cleared [24072046/23943693]) [24073045]
From the node that can't terminate it's pods:
Mar 21 12:22:49 dell-r730-01.perf.lab.eng.rdu.redhat.com atomic-openshift-node[7438]: W0321 12:22:49.119784 7438 kubelet.go:1850] Unable to retrieve pull secret clusterproject0/default-dockercfg-woua1 for clusterproject0/hellopods505 due to secrets "default-dockercfg-woua1" not found. The image pull may not succeed.
@derekwaynecarr @liggitt
Attached is a stacktrace taken from the wedged node, as requested by @ncdc
# curl -o stack.txt http://localhost:6060/debug/pprof/goroutine?debug=2
the nodes reported healthy as well, so node controller is not going to do anything...
what grace period is set for pods when deleting them as part of a namespace deletion?
@liggitt - none. we call delete collection, which in turn calls delete, which will pull the grace period from the strategy. https://github.com/kubernetes/kubernetes/blob/master/pkg/registry/generic/etcd/etcd.go#L398
so a stuck pod can block namespace deletion forever?
For more context, see the long discussion here: https://github.com/kubernetes/kubernetes/issues/20497
@jeremyeder did the pod use a pvc with pv?
@derekwaynecarr no, there were no PVs in this environment at all. The pods simply run hello-openshift.
Had to stop docker on the node with all the stuck pods, dmsetup remove all the thin volumes there, reboot everything and then the namespace was able to clean itself up :/
@jeremyeder can you upload the logs for docker and the node somewhere (or maybe the entire journal from just before you deleted the project to a few minutes after)?
This is the journal from the master:
mar20.syslog.txt.zip
This is the journal from the node that got wedged:
mar20.syslog_node.txt.zip
I think the bug is that failure to fetch image pull secrets for a running pod will prevent our ability to sync that pod.
https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet.go#L1786
This will prevent this sync from ever happening:
https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/dockertools/manager.go#L1809
I agree with @derekwaynecarr that this is the problem in this case.
@jeremyeder were the pods actually running prior to trying to delete the namespace on the problematic node? Or were they ImagePullBackOff?
@jeremyeder told me earlier that the pods were running, but I have to say that what I'm seeing in the node logs makes me think that either:
I think we need to try to reproduce this together and we need to poke around and look at both the state of the containers on the node and what docker containers are running.
If we can reproduce, the output of oc get pods and docker ps -a would be helpful
nope, getPullSecretsForPod never returns an error in the kubelet... it's best effort to load all available pull secrets. the log message for that is a red herring.
There was duplicate similar log messages, tripped me up. thanks @liggitt
There was duplicate similar log messages, tripped me up. thanks @liggitt
innocent! :)
@deads2k famine made me miss the obvious before lunch. you are spared!
@deads2k @liggitt @derekwaynecarr I feel like we should change the signature of getPullSecretsForPod if it is never supposed to return an error -- any objections to doing that as a follow-up?
any objections to doing that as a follow-up?
Not really. I might have like "return what we know, plus an error indicating why it may be incomplete", but clearly I didn't like enough to actually do that.
@pmorie :+1:
Managed to wedge it again, at around the same trigger point. Run around 750-800 pods on one node, and then delete the namespace that they're in. Logs are available privately (too large to attach now that it's at loglevel=5).
@jeremyeder thanks. will ping in IRC, will be good to have the node log, and a YAML representation of a pod that is wedged.
Of the terminating 'pods', I manually chose one and killed the container and infra container. The pod was then removed from the API server, but the PLEG loop in kubelet continues to try and start the pod over and over again... suspect bug in main kubelet sync loop.
@derekwaynecarr do you know why the containers were still alive? Or was it the case that the Kubelet was deleting/recreating infinitely?
@ncdc - the kubelet never attempted to delete the container (bug 1), after manual death of containers, pod is removed from api server (good), but kubelet continues to try and resurrect the pod (bug 2). i suspect some internal state is messed up, working my way backwards.
the pod worker continues to work on the pod that has been removed from the api server, and gets hung on errors tring to setup the network because the pod information cannot be found...
so the manage pod loop is getting wedged when we are unable to start a network for a pod, and it keeps requeing the same item... in general, i think we are reporting an error somewhere where it should not be treated as error and the pod worker keeps requeing the same underlying task
RemoveOrphanedStatuses does not appear to be removing orphaned pods based on the logs...
So I restarted atomic-openshift-node on the environment that showed this behavior, and the system then properly cleared all its state and things were good. We need to get more logging in the kubelet to better diagnose why the kubelet's desired state was stale.
To summarize what I can gather is the root issue:
the reason that the kubelet can miss the deletion timestamp is unclear at this time.
the underlying sync loop was doing the right thing when it sees no deletion timestamp on the pod.
Should we move this conversation to the Kubernetes issue tracker?
On Wed, Mar 23, 2016 at 3:49 PM, Derek Carr [email protected]
wrote:
To summarize what I can gather is the root issue:
- the kubelet never sees that the pods in question have a deletion
timestamp- as a result, the kubelet never kills those pods
the reason that the kubelet can miss the deletion timestamp is unclear at
this time.the underlying sync loop was doing the right thing when it sees no
deletion timestamp on the pod.—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/openshift/origin/issues/8176#issuecomment-200516185
-- Jeremy Eder
we also meet this similar issue in our env, when delete the pod it shows the pod has been deleted, but it keep "Terminating". we create those pod by job.
openshift version: v3.2.0.7
[root@dhcp-128-70 test]# oc delete pods --all
pod "pi-0djsq" deleted
pod "pi-12lw1" deleted
pod "pi-1hoy8" deleted
pod "pi-1jjf9" deleted
pod "pi-2srbj" deleted
pod "pi-41bs7" deleted
pod "pi-4iead" deleted
pod "pi-4np77" deleted
pod "pi-5rphs" deleted
pod "pi-5y3di" deleted
pod "pi-6cimb" deleted
pod "pi-8432a" deleted
pod "pi-d24nl" deleted
pod "pi-dh8zh" deleted
pod "pi-dkliq" deleted
pod "pi-expdd" deleted
pod "pi-hy2lh" deleted
pod "pi-hz940" deleted
pod "pi-l3hfc" deleted
pod "pi-lhoyw" deleted
pod "pi-w0s4i" deleted
pod "pi-x4x9h" deleted
pod "pi-xflxq" deleted
pod "pi-xiq81" deleted
[root@dhcp-128-70 test]# oc get po
NAME READY STATUS RESTARTS AGE
pi-0djsq 0/1 Terminating 0 32m
pi-12lw1 0/1 Terminating 0 32m
pi-1hoy8 0/1 Terminating 0 32m
pi-1jjf9 0/1 Terminating 0 32m
pi-2srbj 0/1 Terminating 0 32m
pi-41bs7 0/1 Terminating 0 32m
pi-4iead 0/1 Terminating 0 32m
pi-4np77 0/1 Terminating 0 32m
pi-5rphs 0/1 Terminating 0 32m
pi-5y3di 0/1 Terminating 0 32m
pi-6cimb 0/1 Terminating 0 32m
pi-8432a 0/1 Terminating 0 32m
pi-d24nl 0/1 Terminating 0 32m
pi-dh8zh 0/1 Terminating 0 32m
pi-dkliq 0/1 Terminating 0 32m
pi-expdd 0/1 Terminating 0 32m
pi-hy2lh 0/1 Terminating 0 32m
pi-hz940 0/1 Terminating 0 32m
pi-l3hfc 0/1 Terminating 0 32m
pi-lhoyw 0/1 Terminating 0 32m
pi-w0s4i 0/1 Terminating 0 32m
pi-x4x9h 0/1 Terminating 0 32m
pi-xflxq 0/1 Terminating 0 32m
pi-xiq81 0/1 Terminating 0 32m
@mdshuai when you encounter this are all the pods on the same node?
@mdshuai when you encounter this are all the pods on the same node?
@derekwaynecarr yes
I spent majority of time today trying to reproduce this scenario. If it
happens again for folks, please reach out to me with information on how I
can debug the system. Most notably, I want to try and schedule new pods to
a node that exhibits this problem to see if they ever run, and I want to
forcefully delete one of the pods from a wedged node to see if the
container is killed on the host. I suspect when this situation occurs the
Kubelet is not seeing any watch notifications delivered, but I need to
reproduce it reliably to better debug.
On Wednesday, March 30, 2016, DeShuai Ma [email protected] wrote:
@mdshuai https://github.com/mdshuai when you encounter this are all the
pods on the same node?@derekwaynecarr https://github.com/derekwaynecarr yes
—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/openshift/origin/issues/8176#issuecomment-203699417
I was finally able to reproduce this on a 3 node cluster after many hours of creating 500 pods, waiting for 200 to run, and then tearing down the project. For many hours, it was fine, but I did notice nodes were getting less and less successful at getting the full set of 200 pods in a running state as test ran more and more, but it was always able to properly tear-down all the pods.
After much time, I was left in a state where finally 1 of the nodes failed to tear down 3 pods. The node continued to report a valid heartbeat back to the API server, but it would no longer launch new pods that were scheduled to it. I was able to ssh into the machine and do a little more sleuthing.
The kubelet actually did see the notification from the watch source about the pod:
I0331 20:15:30.501759 3952 kubelet.go:2394] SyncLoop (ADD, "api"): "test-034xr_e2e-tests-nodestress-8e5yg(f03e60df-f77c-11e5-bed9-080027242396)"
I0331 20:15:30.823634 3952 manager.go:1688] Need to restart pod infra container for "test-034xr_e2e-tests-nodestress-8e5yg(f03e60df-f77c-11e5-bed9-080027242396)" because it is not found
I0331 20:16:59.570316 3952 kubelet.go:2420] SyncLoop (PLEG): "test-034xr_e2e-tests-nodestress-8e5yg(f03e60df-f77c-11e5-bed9-080027242396)", event: &pleg.PodLifecycleEvent{ID:"f03e60df-f77c-11e5-bed9-080027242396", Type:"ContainerStarted", Data:"4749d3890056803ec6d599cd82391d459d8d47246bead00764cb590f2ea2991f"}
I0331 20:23:20.586986 3952 kubelet.go:2401] SyncLoop (UPDATE, "api"): "test-034xr_e2e-tests-nodestress-8e5yg(f03e60df-f77c-11e5-bed9-080027242396)"
Looking at the container in question:
apiVersion: v1
kind: Pod
metadata:
annotations:
kubernetes.io/created-by: |
{"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicationController","namespace":"e2e-tests-nodestress-8e5yg","name":"test","uid":"ededf40d-f77c-11e5-bed9-080027242396","apiVersion":"v1","resourceVersion":"18868"}}
creationTimestamp: 2016-03-31T20:12:48Z
deletionGracePeriodSeconds: 30
deletionTimestamp: 2016-03-31T20:23:50Z
generateName: test-
labels:
name: test
name: test-034xr
namespace: e2e-tests-nodestress-8e5yg
resourceVersion: "19910"
selfLink: /api/v1/namespaces/e2e-tests-nodestress-8e5yg/pods/test-034xr
uid: f03e60df-f77c-11e5-bed9-080027242396
spec:
containers:
- image: openshift/hello-openshift
imagePullPolicy: Always
name: test
resources: {}
terminationMessagePath: /dev/termination-log
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-4s8wq
readOnly: true
dnsPolicy: ClusterFirst
nodeName: kubernetes-node-1
restartPolicy: Always
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
volumes:
- name: default-token-4s8wq
secret:
secretName: default-token-4s8wq
status:
conditions:
- lastProbeTime: null
lastTransitionTime: 2016-03-31T20:15:30Z
message: 'containers with unready status: [test]'
reason: ContainersNotReady
status: "False"
type: Ready
containerStatuses:
- image: openshift/hello-openshift
imageID: ""
lastState: {}
name: test
ready: false
restartCount: 0
state:
waiting:
message: 'Image: openshift/hello-openshift is ready, container is creating'
reason: ContainerCreating
hostIP: 10.245.1.3
phase: Pending
startTime: 2016-03-31T20:15:30Z
You can see it was stuck in waiting state with container creating status.
Name: test-034xr
Namespace: e2e-tests-nodestress-8e5yg
Node: kubernetes-node-1/10.245.1.3
Start Time: Thu, 31 Mar 2016 16:15:30 -0400
Labels: name=test
Status: Terminating (expires Thu, 31 Mar 2016 16:23:50 -0400)
Termination Grace Period: 30s
IP:
Controllers: ReplicationController/test
Containers:
test:
Container ID:
Image: openshift/hello-openshift
Image ID:
Port:
QoS Tier:
cpu: BestEffort
memory: BestEffort
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Environment Variables: <none>
Conditions:
Type Status
Ready False
Volumes:
default-token-4s8wq:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-4s8wq
No events.
Looking at the docker logs:
time="2016-03-31T23:09:46.782184154Z" level=info msg="{Action=create, LoginUID=4294967295, PID=3952}"
time="2016-03-31T23:10:01.782790028Z" level=warning msg="Error getting v2 registry: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
So my current theory is the following:
I have stashed the logs away to analyze more tomorrow.
To clarify, only the pause container is left running on the lnode.
$ docker ps | grep test-034xr
4749d3890056 gcr.io/google_containers/pause:2.0 "/pause" 2 hours ago Up 2 hours k8s_POD.6059dfa2_test-034xr_e2e-tests-nodestress-8e5yg_f03e60df-f77c-11e5-bed9-080027242396_7a0e0878
In addition, new pods that land on that kubelet will never launch when the kubelet + docker is in this state:
here is a log snippet for a new pod i created in a hung kubelet:
cat /var/log/kubelet-debug.log | grep nginx-2040093540-1n7hd
I0331 22:32:02.660718 3952 kubelet.go:2394] SyncLoop (ADD, "api"): "nginx-2040093540-1n7hd_default(63a958c6-f790-11e5-bed9-080027242396)"
I0331 22:32:02.754589 3952 manager.go:1688] Need to restart pod infra container for "nginx-2040093540-1n7hd_default(63a958c6-f790-11e5-bed9-080027242396)" because it is not found
I0331 22:32:04.486144 3952 kubelet.go:2420] SyncLoop (PLEG): "nginx-2040093540-1n7hd_default(63a958c6-f790-11e5-bed9-080027242396)", event: &pleg.PodLifecycleEvent{ID:"63a958c6-f790-11e5-bed9-080027242396", Type:"ContainerStarted", Data:"928a37eb37bf4e063b5eca17156a8172d5a109f586226e0ef99e5aeaa46fb033"}
The container is never launched on the node though even though it saw the add.
I then terminated the pod.
The kubelet log showed the update was delivered:
# journalctl --no-pager -o cat -u kubelet | grep nginx-2040093540-1n7hd
I0331 22:32:02.660718 3952 kubelet.go:2394] SyncLoop (ADD, "api"): "nginx-2040093540-1n7hd_default(63a958c6-f790-11e5-bed9-080027242396)"
I0331 22:32:02.754589 3952 manager.go:1688] Need to restart pod infra container for "nginx-2040093540-1n7hd_default(63a958c6-f790-11e5-bed9-080027242396)" because it is not found
I0331 22:32:04.486144 3952 kubelet.go:2420] SyncLoop (PLEG): "nginx-2040093540-1n7hd_default(63a958c6-f790-11e5-bed9-080027242396)", event: &pleg.PodLifecycleEvent{ID:"63a958c6-f790-11e5-bed9-080027242396", Type:"ContainerStarted", Data:"928a37eb37bf4e063b5eca17156a8172d5a109f586226e0ef99e5aeaa46fb033"}
I0331 23:18:07.412580 3952 kubelet.go:2401] SyncLoop (UPDATE, "api"): "nginx-2040093540-1n7hd_default(63a958c6-f790-11e5-bed9-080027242396)"
But the pod is never terminated, and its pause container remains:
# docker ps | grep nginx-2040093540-1n7hd
928a37eb37bf gcr.io/google_containers/pause:2.0 "/pause" 50 minutes ago Up 50 minutes k8s_POD.6059dfa2_nginx-2040093540-1n7hd_default_63a958c6-f790-11e5-bed9-080027242396_ce888928
The above pod was for a container not on the local machine, so I chose an image that was definitely on the machine. docker fails to start it, and messages like following in logs.
[root@kubernetes-node-1 vagrant]# cat /var/log/docker-debug-2 | grep -C 3 hello-1348166448-uhvak
time="2016-03-31T23:23:56.380365056Z" level=info msg="{Action=create, LoginUID=4294967295, PID=3952}"
time="2016-03-31T23:23:56.578526656Z" level=error msg="Handler for GET /containers/8290a58a1ee335171ada06b932b6b46670808a8a096df72bd839966e8b968e6f/json returned error: Unknown device 8290a58a1ee335171ada06b932b6b46670808a8a096df72bd839966e8b968e6f"
time="2016-03-31T23:23:56.579037069Z" level=error msg="HTTP Error" err="Unknown device 8290a58a1ee335171ada06b932b6b46670808a8a096df72bd839966e8b968e6f" statusCode=500
time="2016-03-31T23:23:56.970677495Z" level=info msg="{Action=start, ID=8290a58a1ee335171ada06b932b6b46670808a8a096df72bd839966e8b968e6f, LoginUID=4294967295, PID=3952, Config={Hostname=hello-1348166448-uhvak, AttachStdin=false, AttachStdout=false, AttachStderr=false, Tty=false, OpenStdin=false, StdinOnce=false, Env=[KUBERNETES_SERVICE_PORT_HTTPS=443 KUBERNETES_PORT=tcp://10.247.0.1:443 KUBERNETES_PORT_443_TCP=tcp://10.247.0.1:443 KUBERNETES_PORT_443_TCP_PROTO=tcp KUBERNETES_PORT_443_TCP_PORT=443 KUBERNETES_PORT_443_TCP_ADDR=10.247.0.1 KUBERNETES_SERVICE_HOST=10.247.0.1 KUBERNETES_SERVICE_PORT=443], Image=gcr.io/google_containers/pause:2.0, Entrypoint={parts:[/pause]}, NetworkDisabled=false, Labels=map[io.kubernetes.container.restartCount:0 io.kubernetes.container.terminationMessagePath: io.kubernetes.pod.name:hello-1348166448-uhvak io.kubernetes.pod.namespace:default io.kubernetes.pod.terminationGracePeriod:30 io.kubernetes.pod.uid:a39fc7e2-f797-11e5-bed9-080027242396 io.kubernetes.container.hash:6059dfa2 io.kubernetes.container.name:POD]}, HostConfig={MemorySwap=-1, CPUShares=2, OomKillDisable=false, Privileged=false, PublishAllPorts=false, DNS=[10.247.0.10], DNSSearch=[default.svc.cluster.local svc.cluster.local cluster.local redhat.com], NetworkMode=default, ReadonlyRootfs=false, LogConfig={Type:json-file Config:map[]}}}"
2016/03/31 23:23:57 http: multiple response.WriteHeader calls
time="2016-03-31T23:24:02.326389811Z" level=warning msg="Error getting v2 registry: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
time="2016-03-31T23:24:22.348414865Z" level=info msg="{Action=create, LoginUID=4294967295, PID=3952}"
so basically, docker is saying its healthy (its not), and kubelet is healthy (it is)
So yeah, it looks like if docker does not return with an image pull error or a failure to start error:
time="2016-04-01T01:00:50.411236569Z" level=info msg="{Action=create, LoginUID=4294967295, PID=3952}"
time="2016-04-01T01:00:50.577583649Z" level=error msg="Handler for GET /containers/052afa147d67f028d4245666bb1e4b735aacad3f867ebc3c5d8564b5b902a96f/json returned error: Unknown device 052afa147d67f028d4245666bb1e4b735aacad3f867ebc3c5d8564b5b902a96f"
time="2016-04-01T01:00:50.577666266Z" level=error msg="HTTP Error" err="Unknown device 052afa147d67f028d4245666bb1e4b735aacad3f867ebc3c5d8564b5b902a96f" statusCode=500
time="2016-04-01T01:00:50.924279733Z" level=info msg="{Action=start, ID=052afa147d67f028d4245666bb1e4b735aacad3f867ebc3c5d8564b5b902a96f, LoginUID=4294967295, PID=3952, Config={Hostname=testid-794189027-qd2kr, AttachStdin=false, AttachStdout=false, AttachStderr=false, Tty=false, OpenStdin=false, StdinOnce=false, Env=[KUBERNETES_SERVICE_PORT_HTTPS=443 KUBERNETES_PORT=tcp://10.247.0.1:443 KUBERNETES_PORT_443_TCP=tcp://10.247.0.1:443 KUBERNETES_PORT_443_TCP_PROTO=tcp KUBERNETES_PORT_443_TCP_PORT=443 KUBERNETES_PORT_443_TCP_ADDR=10.247.0.1 KUBERNETES_SERVICE_HOST=10.247.0.1 KUBERNETES_SERVICE_PORT=443], Image=gcr.io/google_containers/pause:2.0, Entrypoint={parts:[/pause]}, NetworkDisabled=false, Labels=map[io.kubernetes.container.hash:6059dfa2 io.kubernetes.container.name:POD io.kubernetes.container.restartCount:0 io.kubernetes.container.terminationMessagePath: io.kubernetes.pod.name:testid-794189027-qd2kr io.kubernetes.pod.namespace:default io.kubernetes.pod.terminationGracePeriod:30 io.kubernetes.pod.uid:2d10889d-f7a5-11e5-bed9-080027242396]}, HostConfig={MemorySwap=-1, CPUShares=2, OomKillDisable=false, Privileged=false, PublishAllPorts=false, DNS=[10.247.0.10], DNSSearch=[default.svc.cluster.local svc.cluster.local cluster.local redhat.com], NetworkMode=default, ReadonlyRootfs=false, LogConfig={Type:json-file Config:map[]}}}"
2016/04/01 01:00:51 http: multiple response.WriteHeader calls
time="2016-04-01T01:00:56.059338295Z" level=warning msg="Error getting v2 registry: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
time="2016-04-01T01:01:16.079062058Z" level=info msg="{Action=create, LoginUID=4294967295, PID=3952}"
time="2016-04-01T01:01:31.080430011Z" level=warning msg="Error getting v2 registry: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
time="2016-04-01T01:01:51.103674671Z" level=info msg="{Action=create, LoginUID=4294967295, PID=3952}"
time="2016-04-01T01:02:06.106075458Z" level=warning msg="Error getting v2 registry: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
time="2016-04-01T01:02:26.127051624Z" level=info msg="{Action=create, LoginUID=4294967295, PID=3952}"
time="2016-04-01T01:02:41.127613194Z" level=warning msg="Error getting v2 registry: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
time="2016-04-01T01:03:01.153306352Z" level=info msg="{Action=create, LoginUID=4294967295, PID=3952}"
time="2016-04-01T01:03:16.153835194Z" level=warning msg="Error getting v2 registry: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
time="2016-04-01T01:03:36.172615877Z" level=info msg="{Action=create, LoginUID=4294967295, PID=3952}"
time="2016-04-01T01:03:51.173271101Z" level=warning msg="Error getting v2 registry: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
If I run an image using a specified image:imageid on a node that exhibits this behavior, the kubelet will launch the container correctly if the image was already pulled, and docker will launch the container correctly. If I docker run on the node directly when it has this behavior an already pulled image, it will run. I suspect there is an issue right now in the kubelet when docker can no longer connect to the hub somewhere in the image puller that is causing a deadlock on the pods in question that results in them failing to start/terminate or properly get their state updated to reflect an error in the image pull.
As expected, I now see this:
E0401 00:47:50.612585 3952 event.go:198] Server rejected event '&api.Event{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"test-034xr.144115c075d63dae", GenerateName:"", Namespace:"e2e-tests-nodestress-8e5yg", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil)}, InvolvedObject:api.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-nodestress-8e5yg", Name:"test-034xr", UID:"f03e60df-f77c-11e5-bed9-080027242396", APIVersion:"v1", ResourceVersion:"19138", FieldPath:""}, Reason:"FailedSync", Message:"Error syncing pod, skipping: failed to \"StartContainer\" for \"test\" with ErrImagePull: \"Error while pulling image: Get https://index.docker.io/v1/repositories/openshift/hello-openshift/images: dial tcp: lookup index.docker.io: no such host\"\n", Source:api.EventSource{Component:"kubelet", Host:"kubernetes-node-1"}, FirstTimestamp:unversioned.Time{Time:time.Time{sec:63595068470, nsec:599630254, loc:(*time.Location)(0x2e77fe0)}}, LastTimestamp:unversioned.Time{Time:time.Time{sec:63595068470, nsec:599630254, loc:(*time.Location)(0x2e77fe0)}}, Count:1, Type:"Warning"}': 'events "test-034xr.144115c075d63dae" is forbidden: Unable to create new content in namespace e2e-tests-nodestress-8e5yg because it is being terminated.' (will not retry!)
But the pod status is not being updated by the kubelet to reflect that...
Ha funny enough, I am not sure what caused it to unwedge, but now the pod that was hung for hours is disappeared along with namespace.
Basically, for this pod in question, after 4 hrs it finally unwedged:
[root@kubernetes-node-1 vagrant]# journalctl -u kubelet -o cat --no-pager | grep "test-034xr"
I0331 20:15:30.501759 3952 kubelet.go:2394] SyncLoop (ADD, "api"): "test-034xr_e2e-tests-nodestress-8e5yg(f03e60df-f77c-11e5-bed9-080027242396)"
I0331 20:15:30.823634 3952 manager.go:1688] Need to restart pod infra container for "test-034xr_e2e-tests-nodestress-8e5yg(f03e60df-f77c-11e5-bed9-080027242396)" because it is not found
I0331 20:16:59.570316 3952 kubelet.go:2420] SyncLoop (PLEG): "test-034xr_e2e-tests-nodestress-8e5yg(f03e60df-f77c-11e5-bed9-080027242396)", event: &pleg.PodLifecycleEvent{ID:"f03e60df-f77c-11e5-bed9-080027242396", Type:"ContainerStarted", Data:"4749d3890056803ec6d599cd82391d459d8d47246bead00764cb590f2ea2991f"}
I0331 20:23:20.586986 3952 kubelet.go:2401] SyncLoop (UPDATE, "api"): "test-034xr_e2e-tests-nodestress-8e5yg(f03e60df-f77c-11e5-bed9-080027242396)"
E0401 00:47:15.578764 3952 event.go:198] Server rejected event '&api.Event{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"test-034xr.144115b84e3de940", GenerateName:"", Namespace:"e2e-tests-nodestress-8e5yg", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil)}, InvolvedObject:api.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-nodestress-8e5yg", Name:"test-034xr", UID:"f03e60df-f77c-11e5-bed9-080027242396", APIVersion:"v1", ResourceVersion:"19138", FieldPath:"spec.containers{test}"}, Reason:"Pulling", Message:"pulling image \"openshift/hello-openshift\"", Source:api.EventSource{Component:"kubelet", Host:"kubernetes-node-1"}, FirstTimestamp:unversioned.Time{Time:time.Time{sec:63595068435, nsec:575597376, loc:(*time.Location)(0x2e77fe0)}}, LastTimestamp:unversioned.Time{Time:time.Time{sec:63595068435, nsec:575597376, loc:(*time.Location)(0x2e77fe0)}}, Count:1, Type:"Normal"}': 'events "test-034xr.144115b84e3de940" is forbidden: Unable to create new content in namespace e2e-tests-nodestress-8e5yg because it is being terminated.' (will not retry!)
E0401 00:47:50.605020 3952 event.go:198] Server rejected event '&api.Event{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"test-034xr.144115c075d4b533", GenerateName:"", Namespace:"e2e-tests-nodestress-8e5yg", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil)}, InvolvedObject:api.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-nodestress-8e5yg", Name:"test-034xr", UID:"f03e60df-f77c-11e5-bed9-080027242396", APIVersion:"v1", ResourceVersion:"19138", FieldPath:"spec.containers{test}"}, Reason:"Failed", Message:"Failed to pull image \"openshift/hello-openshift\": Error while pulling image: Get https://index.docker.io/v1/repositories/openshift/hello-openshift/images: dial tcp: lookup index.docker.io: no such host", Source:api.EventSource{Component:"kubelet", Host:"kubernetes-node-1"}, FirstTimestamp:unversioned.Time{Time:time.Time{sec:63595068470, nsec:599529779, loc:(*time.Location)(0x2e77fe0)}}, LastTimestamp:unversioned.Time{Time:time.Time{sec:63595068470, nsec:599529779, loc:(*time.Location)(0x2e77fe0)}}, Count:1, Type:"Warning"}': 'events "test-034xr.144115c075d4b533" is forbidden: Unable to create new content in namespace e2e-tests-nodestress-8e5yg because it is being terminated.' (will not retry!)
E0401 00:47:50.612585 3952 event.go:198] Server rejected event '&api.Event{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"test-034xr.144115c075d63dae", GenerateName:"", Namespace:"e2e-tests-nodestress-8e5yg", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil)}, InvolvedObject:api.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-nodestress-8e5yg", Name:"test-034xr", UID:"f03e60df-f77c-11e5-bed9-080027242396", APIVersion:"v1", ResourceVersion:"19138", FieldPath:""}, Reason:"FailedSync", Message:"Error syncing pod, skipping: failed to \"StartContainer\" for \"test\" with ErrImagePull: \"Error while pulling image: Get https://index.docker.io/v1/repositories/openshift/hello-openshift/images: dial tcp: lookup index.docker.io: no such host\"\n", Source:api.EventSource{Component:"kubelet", Host:"kubernetes-node-1"}, FirstTimestamp:unversioned.Time{Time:time.Time{sec:63595068470, nsec:599630254, loc:(*time.Location)(0x2e77fe0)}}, LastTimestamp:unversioned.Time{Time:time.Time{sec:63595068470, nsec:599630254, loc:(*time.Location)(0x2e77fe0)}}, Count:1, Type:"Warning"}': 'events "test-034xr.144115c075d63dae" is forbidden: Unable to create new content in namespace e2e-tests-nodestress-8e5yg because it is being terminated.' (will not retry!)
I0401 00:47:51.599122 3952 manager.go:1368] Killing container "4749d3890056803ec6d599cd82391d459d8d47246bead00764cb590f2ea2991f e2e-tests-nodestress-8e5yg/test-034xr" with 30 second grace period
I0401 00:47:51.621944 3952 kubelet.go:2401] SyncLoop (UPDATE, "api"): "test-034xr_e2e-tests-nodestress-8e5yg(f03e60df-f77c-11e5-bed9-080027242396)"
I0401 00:47:51.635796 3952 kubelet.go:2404] SyncLoop (REMOVE, "api"): "test-034xr_e2e-tests-nodestress-8e5yg(f03e60df-f77c-11e5-bed9-080027242396)"
I0401 00:47:51.640217 3952 kubelet.go:2235] Killing unwanted pod "test-034xr"
I0401 00:47:51.642503 3952 manager.go:1368] Killing container "4749d3890056803ec6d599cd82391d459d8d47246bead00764cb590f2ea2991f e2e-tests-nodestress-8e5yg/test-034xr" with 0 second grace period
I0401 00:47:51.647741 3952 manager.go:1402] Container "4749d3890056803ec6d599cd82391d459d8d47246bead00764cb590f2ea2991f e2e-tests-nodestress-8e5yg/test-034xr" termination failed after 5.203603ms: API error (500): Cannot stop container 4749d3890056803ec6d599cd82391d459d8d47246bead00764cb590f2ea2991f: active container for 4749d3890056803ec6d599cd82391d459d8d47246bead00764cb590f2ea2991f does not exist
E0401 00:47:51.647903 3952 kubelet.go:2238] Failed killing the pod "test-034xr": failed to "KillContainer" for "POD" with KillContainerError: "API error (500): Cannot stop container 4749d3890056803ec6d599cd82391d459d8d47246bead00764cb590f2ea2991f: active container for 4749d3890056803ec6d599cd82391d459d8d47246bead00764cb590f2ea2991f does not exist\n\n"
I0401 00:47:51.827940 3952 manager.go:1400] Container "4749d3890056803ec6d599cd82391d459d8d47246bead00764cb590f2ea2991f e2e-tests-nodestress-8e5yg/test-034xr" exited after 228.768953ms
W0401 00:47:51.827996 3952 manager.go:1406] No ref for pod '"4749d3890056803ec6d599cd82391d459d8d47246bead00764cb590f2ea2991f e2e-tests-nodestress-8e5yg/test-034xr"'
I0401 00:47:53.015221 3952 kubelet.go:2235] Killing unwanted pod "test-034xr"
Other pods on this node still remain wedged though.
When temporary isn't temporary, it appears bad things happen:
https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/container/image_puller.go#L112
So if a kubelet finds itself with a pod in this status:
https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet.go#L3382
This state is basically the kubelet saying i don't know what state i am in exactly, so I think I need to look at this more tomorrow to see if we can get a more targeted test scenario that reproduces that state which results in a stuck pod when docker is not able to speak to the hub.
I see no good reason for us to treat RegistryUnavailable differently than any other error since in some cases the registry will never become available. I have now decided this is the most likely source of the wedge, and we should report that as a normal image pull error so the container can properly terminate.
Tomorrow, I will try to mock out an environment where we can simulate a pet image name whose pull always returns RegistryUnavailable to see if that can reproduce the scenario where the pod is not terminating as expected.
Great... so restarting docker unwedged the kubelet, and pods terminated as expected. Restarting docker again where it still no longer had network connectivity to the registry did not result in pods getting wedged when I created a handful on the node and then deleted the namespace. Creating 100 pods on the node where docker did not have connectivity to the registry did re-wedge the node and waiting 10 minutes before terminating the namespace did end up leaving 1 pod in a terminating state for what is now 18 minutes but I was an idiot and forgot to bump up the log level. So it looks like this is an issue if you have a kubelet at density that cannot contact the image registry and not just a kubelet running a few pods.
I can now reliably reproduce this error.
Modified the kubelet to return RegistryUnavailable when pulling openshift/hello-openshift, launched a rc with 3 replicas, waited a minute or two for pods to appear hung in ContainerCreating state, delete the namespace, and voila! the pods never get deleted!
Upstream PR: https://github.com/kubernetes/kubernetes/pull/23746
and... any way to delete that pods ? I've got so much pod in that state...
@metal3d oc delete pod/<name of pod> --grace-period=0 will force deletion.
A Big thanks ! @ncdc
I have 3 pods in Terminating status. Even the command from @ncdc hangs forever :(
Here is the detailed info:
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x58 pc=0x1952993]
goroutine 1 [running]:
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl.ReaperFor(0x0, 0x0, 0x3c2d039, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc421abe090)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/delete.go:82 +0x1373
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/cmd/util.(*ring1Factory).Reaper(0xc420c43590, 0xc4209caa10, 0x0, 0x0, 0xc421a9c001, 0x1)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/cmd/util/factory_object_mapping.go:302 +0x151
github.com/openshift/origin/pkg/oc/cli/util/clientcmd.(*ring1Factory).Reaper(0xc42039d420, 0xc4209caa10, 0x0, 0xc42164d608, 0x4f204a, 0xc420082000)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/pkg/oc/cli/util/clientcmd/factory_object_mapping.go:287 +0x93d
github.com/openshift/origin/pkg/oc/cli/util/clientcmd.(*Factory).Reaper(0xc420c435c0, 0xc4209caa10, 0x0, 0x0, 0x0, 0x0)
<autogenerated>:1 +0x47
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/cmd.ReapResult.func1(0xc4209cac40, 0x0, 0x0, 0x79c17e0, 0x4c47a6a)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/cmd/delete.go:250 +0xe9
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource.ContinueOnErrorVisitor.Visit.func1(0xc4209cac40, 0x0, 0x0, 0x0, 0x0)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource/visitor.go:396 +0x164
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource.DecoratedVisitor.Visit.func1(0xc4209cac40, 0x0, 0x0, 0x7, 0x1)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource/visitor.go:372 +0xe7
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource.FlattenListVisitor.Visit.func1(0xc4209cac40, 0x0, 0x0, 0xc421abba20, 0x7f10363e6228)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource/visitor.go:433 +0x4fe
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource.EagerVisitorList.Visit.func1(0xc4209cac40, 0x0, 0x0, 0x1, 0xc421abba20)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource/visitor.go:255 +0x164
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource.(*Info).Visit(0xc4209cac40, 0xc421abba20, 0x28, 0x464dba0)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource/visitor.go:105 +0x42
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource.EagerVisitorList.Visit(0xc421a9c1a0, 0x1, 0x1, 0xc421abe2a0, 0x1, 0xc421abe2a0)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource/visitor.go:250 +0xea
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource.(*EagerVisitorList).Visit(0xc421abb960, 0xc421abe2a0, 0x7f103654ad90, 0x0)
<autogenerated>:1 +0x58
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource.FlattenListVisitor.Visit(0x79f65a0, 0xc421abb960, 0xc420cf0980, 0xc4213cc4c0, 0x60000000001, 0xc4213cc4c0)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource/visitor.go:428 +0x9e
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource.(*FlattenListVisitor).Visit(0xc421abb980, 0xc4213cc4c0, 0x28, 0xc420034700)
<autogenerated>:1 +0x58
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource.DecoratedVisitor.Visit(0x79f6620, 0xc421abb980, 0xc421a9c1b0, 0x2, 0x2, 0xc421abb9e0, 0xc421a9c101, 0xc421abb9e0)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource/visitor.go:363 +0x9b
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource.(*DecoratedVisitor).Visit(0xc421abe270, 0xc421abb9e0, 0xc421a9c1c0, 0xc42164dbc8)
<autogenerated>:1 +0x62
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource.ContinueOnErrorVisitor.Visit(0x79f6520, 0xc421abe270, 0xc4209cacb0, 0x0, 0x2)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource/visitor.go:391 +0xe4
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource.(*ContinueOnErrorVisitor).Visit(0xc421a9c1c0, 0xc4209cacb0, 0x415e18, 0x70)
<autogenerated>:1 +0x4f
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource.(*Result).Visit(0xc420463e80, 0xc4209cacb0, 0xc420cf6050, 0x7a36460)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/resource/result.go:98 +0x62
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/cmd.ReapResult(0xc420463e80, 0x7a4a1e0, 0xc420c435c0, 0x79efaa0, 0xc42000e018, 0xc420250001, 0x0, 0x1, 0xc421300001, 0x7a36460, ...)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/cmd/delete.go:245 +0x16c
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/cmd.(*DeleteOptions).RunDelete(0xc42126ae70, 0xc420473200, 0x0)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/cmd/delete.go:235 +0xd2
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/cmd.NewCmdDelete.func1(0xc420473200, 0xc42025ee00, 0x1, 0x2)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubectl/cmd/delete.go:142 +0x178
github.com/openshift/origin/vendor/github.com/spf13/cobra.(*Command).execute(0xc420473200, 0xc42025ecc0, 0x2, 0x2, 0xc420473200, 0xc42025ecc0)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/spf13/cobra/command.go:603 +0x234
github.com/openshift/origin/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc4203d4d80, 0x202c551, 0xc4203d4d80, 0xc420238270)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/spf13/cobra/command.go:689 +0x2fe
github.com/openshift/origin/vendor/github.com/spf13/cobra.(*Command).Execute(0xc4203d4d80, 0x2, 0xc4203d4d80)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/spf13/cobra/command.go:648 +0x2b
main.main()
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/cmd/oc/oc.go:41 +0x293
@anandbaskaran if command provided by @ncdc still hangs, you can try forcing it with --force. It gives:
oc delete pod/
Most helpful comment
@metal3d
oc delete pod/<name of pod> --grace-period=0will force deletion.