Origin: Pods fail to start after server reboot

Created on 24 Apr 2018 · 8Comments · Source: openshift/origin

I use origin 3.7, both master and node on the same server.
After server reboot all pods fail to start with following errors in origin-node logs.
The only way I found is to reset and recreate docker storage. Is there a way to avoid/fix this bug?

Apr 23 22:24:20 os-test3 origin-node[2542]: E0423 22:24:20.140764    2616 cni.go:304] Error deleting network when building cni runtime conf: could not retrieve port mappings: checkpoint is not found.              
Apr 23 22:24:20 os-test3 origin-node[2542]: E0423 22:24:20.141255    2616 remote_runtime.go:114] StopPodSandbox "d39009bb25733767895364d47c4c7b156df4c58b6e3f512a3894a08890987f11" from runtime service failed: rpc e
rror: code = 2 desc = NetworkPlugin cni failed to teardown pod "hawkular-cassandra-1-p6lzt_openshift-infra" network: could not retrieve port mappings: checkpoint is not found.                                      
Apr 23 22:24:20 os-test3 origin-node[2542]: E0423 22:24:20.141340    2616 kuberuntime_manager.go:775] Failed to stop sandbox {"docker" "d39009bb25733767895364d47c4c7b156df4c58b6e3f512a3894a08890987f11"}           
Apr 23 22:24:20 os-test3 origin-node[2542]: E0423 22:24:20.141402    2616 remote_runtime.go:114] StopPodSandbox "75112e3ea11bdd0d2714fcd5afb1dd1f8ab69561ca2a980f4d6053fa073d27f7" from runtime service failed: rpc e
rror: code = 2 desc = NetworkPlugin cni failed to teardown pod "controller-manager-fs9br_kube-service-catalog" network: could not retrieve port mappings: checkpoint is not found.                                   
Apr 23 22:24:20 os-test3 origin-node[2542]: E0423 22:24:20.141440    2616 kuberuntime_manager.go:775] Failed to stop sandbox {"docker" "75112e3ea11bdd0d2714fcd5afb1dd1f8ab69561ca2a980f4d6053fa073d27f7"}           
Apr 23 22:24:20 os-test3 origin-node[2542]: E0423 22:24:20.141496    2616 kuberuntime_manager.go:570] killPodWithSyncResult failed: failed to "KillPodSandbox" for "7ddf9c90-472a-11e8-a67e-005056827a01" with KillPo
dSandboxError: "rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod \"controller-manager-fs9br_kube-service-catalog\" network: could not retrieve port mappings: checkpoint is not found."           
Apr 23 22:24:20 os-test3 origin-node[2542]: E0423 22:24:20.141539    2616 pod_workers.go:186] Error syncing pod 7ddf9c90-472a-11e8-a67e-005056827a01 ("controller-manager-fs9br_kube-service-catalog(7ddf9c90-472a-11
e8-a67e-005056827a01)"), skipping: failed to "KillPodSandbox" for "7ddf9c90-472a-11e8-a67e-005056827a01" with KillPodSandboxError: "rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod \"controller-
manager-fs9br_kube-service-catalog\" network: could not retrieve port mappings: checkpoint is not found."                                                                                                            
Apr 23 22:24:20 os-test3 origin-node[2542]: E0423 22:24:20.141491    2616 kuberuntime_manager.go:570] killPodWithSyncResult failed: failed to "KillPodSandbox" for "db907be0-46f8-11e8-b5d9-005056827a01" with KillPo
dSandboxError: "rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod \"hawkular-cassandra-1-p6lzt_openshift-infra\" network: could not retrieve port mappings: checkpoint is not found."              
Apr 23 22:24:20 os-test3 origin-node[2542]: E0423 22:24:20.141675    2616 pod_workers.go:186] Error syncing pod db907be0-46f8-11e8-b5d9-005056827a01 ("hawkular-cassandra-1-p6lzt_openshift-infra(db907be0-46f8-11e8-
b5d9-005056827a01)"), skipping: failed to "KillPodSandbox" for "db907be0-46f8-11e8-b5d9-005056827a01" with KillPodSandboxError: "rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod \"hawkular-cassa
ndra-1-p6lzt_openshift-infra\" network: could not retrieve port mappings: checkpoint is not found."                                                                                                                  
Apr 23 22:24:20 os-test3 origin-node[2542]: E0423 22:24:20.141749    2616 remote_runtime.go:114] StopPodSandbox "e8573132f565875633fc034ca16296030f989350290800271b210400f1b8212b" from runtime service failed: rpc e
rror: code = 2 desc = NetworkPlugin cni failed to teardown pod "nodejs-mongodb-example-8-5n2jz_test1" network: could not retrieve port mappings: checkpoint is not found.                                            
Apr 23 22:24:20 os-test3 origin-node[2542]: E0423 22:24:20.141845    2616 kuberuntime_manager.go:775] Failed to stop sandbox {"docker" "e8573132f565875633fc034ca16296030f989350290800271b210400f1b8212b"}           
Apr 23 22:24:20 os-test3 origin-node[2542]: E0423 22:24:20.141896    2616 kuberuntime_manager.go:570] killPodWithSyncResult failed: failed to "KillPodSandbox" for "22fe7e4e-46fb-11e8-b5d9-005056827a01" with KillPo
dSandboxError: "rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod \"nodejs-mongodb-example-8-5n2jz_test1\" network: could not retrieve port mappings: checkpoint is not found."                    
Apr 23 22:24:20 os-test3 origin-node[2542]: E0423 22:24:20.141941    2616 pod_workers.go:186] Error syncing pod 22fe7e4e-46fb-11e8-b5d9-005056827a01 ("nodejs-mongodb-example-8-5n2jz_test1(22fe7e4e-46fb-11e8-b5d9-0
05056827a01)"), skipping: failed to "KillPodSandbox" for "22fe7e4e-46fb-11e8-b5d9-005056827a01" with KillPodSandboxError: "rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod \"nodejs-mongodb-examp
le-8-5n2jz_test1\" network: could not retrieve port mappings: checkpoint is not found."                                                                                                                              
Apr 23 22:24:21 os-test3 origin-node[2542]: I0423 22:24:21.152335    2616 kuberuntime_manager.go:389] No ready sandbox for pod "hawkular-metrics-qlwst_openshift-infra(f1ab2b33-46f8-11e8-b5d9-005056827a01)" can be 
found. Need to start a new one

Version

openshift v3.7.1+282e43f-42
kubernetes v1.7.6+a08f5eeb62
etcd 3.2.8

Steps To Reproduce

just reboot

Current Result

Expected Result

Additional Information

kinquestion lifecyclrotten sipod

Source

ustm

Most helpful comment

@ustm @sjenning i think this is related to #19604 .

are you running the kubelet as a container. we discovered for our setup, the /var/lib/dockershim folder wasn't bind mount to the host for some reasons and whenever a reboot/atomic.node restart, the checkpoint files in /var/lib/dockershim folder is lost as the folder is ephemeral in the kubelet container.

Misterhex on 11 May 2018

👍2

All 8 comments

@openshift/sig-pod

jwforres on 1 May 2018

The checkpoint to which the message is referring is under /var/lib/dockershim/. You might try removing everything in that directory and doing a docker rm -f $(docker ps -a -q) to remove all existing docker containers to get back to a pristine state.

sjenning on 1 May 2018

@ustm @sjenning i think this is related to #19604 .

Misterhex on 11 May 2018

👍2

see the BZ mentioned in https://github.com/openshift/origin/issues/19138#issuecomment-389112756

DanyC97 on 15 May 2018

👍1

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot on 13 Aug 2018

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot on 12 Sep 2018

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-bot on 12 Oct 2018

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.