Calico: Calico 3.6-3.8 Pending forever on single-node Kubeadm

Created on 3 Jul 2019 · 25Comments · Source: projectcalico/calico

I am following the documentation on https://docs.projectcalico.org/v3.8/getting-started/kubernetes/ and pods are pending forever.

Expected Behavior

All pods running in watch kubectl get pods --all-namespaces.

Current Behavio

When I watch kubectl get pods --all-namespaces over 15 minutes, it is still pending:

NAMESPACE     NAME                                       READY   STATUS     RESTARTS   AGE
kube-system   calico-kube-controllers-59f54d6bbc-k8dhf   0/1     Pending    0          17m
kube-system   calico-node-r7v2n                          0/1     Init:0/3   0          17m
kube-system   coredns-5c98db65d4-5jrl2                   0/1     Pending    0          17m
kube-system   coredns-5c98db65d4-d2rc4                   0/1     Pending    0          17m
kube-system   etcd-cherokee                              1/1     Running    0          16m
kube-system   kube-apiserver-cherokee                    1/1     Running    0          16m
kube-system   kube-controller-manager-cherokee           1/1     Running    0          16m
kube-system   kube-proxy-hfp6c                           1/1     Running    0          17m
kube-system   kube-scheduler-cherokee                    1/1     Running    0          16m

Possible Solution

I don't know why, but version 3.5 just works:

 curl \
   https://docs.projectcalico.org/v3.5/getting-started/kubernetes/installation/hosted/kubernetes-datastore/calico-networking/1.7/calico.yaml \
   -O
   kubectl apply -f calico.yaml

I've tried 3.6, 3.7 and 3.8, with the same results.

Steps to Reproduce (for bugs)

Exec:

sudo kubeadm init --pod-network-cidr=192.168.0.0/16
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
kubectl apply -f https://docs.projectcalico.org/v3.8/manifests/calico.yaml
watch kubectl get pods --all-namespaces

Wait 15 minutes

Context

I can't create a cluster with newer versions of Calico.

Your Environment

Calico version 3.6-3.8
Orchestrator version (e.g. kubernetes, mesos, rkt): kubeadm/kubernetes 1.15.0
Operating System and version: Ubuntu 18.04.2 LTS
Link to your project (optional): -

kinsupport

Source

staticdev

👍5

Most helpful comment

i am already fix it.

kubeadm reset
ifconfig cni0 down
ip link delete cni0
ifconfig flannel.1 down
ip link delete flannel.1
rm -rf /var/lib/cni/

and then excute kubeadm init.....

withlin on 6 Jul 2019

👍5 ❤1 🎉1

All 25 comments

the same problem.

withlin on 3 Jul 2019

Could it be a resource request? Maybe your node isn't big enough?

fasaxc on 3 Jul 2019

What does kubectl describe pod say for each of the non-running pods?

fasaxc on 3 Jul 2019

@fasaxc I don't think it is a resource problem, since it is a dedicated machine with 4 core and 16GB dedicated to it and nothing else is running.

staticdev on 3 Jul 2019

@fasaxc All kubectl describes:

kubectl describe pod calico-kube-controllers-59f54d6bbc-gbj95 --namespace=kube-system
Name:                 calico-kube-controllers-59f54d6bbc-gbj95
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 <none>
Labels:               k8s-app=calico-kube-controllers
                      pod-template-hash=59f54d6bbc
Annotations:          scheduler.alpha.kubernetes.io/critical-pod: 
Status:               Pending
IP:                   
Controlled By:        ReplicaSet/calico-kube-controllers-59f54d6bbc
Containers:
  calico-kube-controllers:
    Image:      calico/kube-controllers:v3.8.0
    Port:       <none>
    Host Port:  <none>
    Readiness:  exec [/usr/bin/check-status -r] delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      ENABLED_CONTROLLERS:  node
      DATASTORE_TYPE:       kubernetes
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from calico-kube-controllers-token-778vt (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  calico-kube-controllers-token-778vt:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  calico-kube-controllers-token-778vt
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  beta.kubernetes.io/os=linux
Tolerations:     CriticalAddonsOnly
                 node-role.kubernetes.io/master:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  41s (x7 over 6m35s)  default-scheduler  0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.

---

kubectl describe pod calico-node-flgbf --namespace=kube-system
Name:                 calico-node-flgbf
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 cherokee/150.164.7.70
Start Time:           Wed, 03 Jul 2019 06:57:19 -0300
Labels:               controller-revision-hash=844ddd97c6
                      k8s-app=calico-node
                      pod-template-generation=1
Annotations:          scheduler.alpha.kubernetes.io/critical-pod: 
Status:               Pending
IP:                   150.164.7.70
Controlled By:        DaemonSet/calico-node
Init Containers:
  upgrade-ipam:
    Container ID:  docker://8b0acccf0d1f633b1af29d8cfe2f5b45a53b074e16da4d74b0eca79f4df2ecc6
    Image:         calico/cni:v3.8.0
    Image ID:      docker-pullable://calico/cni@sha256:decba0501ab0658e6e7da2f5625f1eabb8aba5690f9206caba3bf98caca5094c
    Port:          <none>
    Host Port:     <none>
    Command:
      /opt/cni/bin/calico-ipam
      -upgrade
    State:          Running
      Started:      Wed, 03 Jul 2019 06:57:23 -0300
    Ready:          False
    Restart Count:  0
    Environment:
      KUBERNETES_NODE_NAME:        (v1:spec.nodeName)
      CALICO_NETWORKING_BACKEND:  <set to the key 'calico_backend' of config map 'calico-config'>  Optional: false
    Mounts:
      /host/opt/cni/bin from cni-bin-dir (rw)
      /var/lib/cni/networks from host-local-net-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-n5wk2 (ro)
  install-cni:
    Container ID:  
    Image:         calico/cni:v3.8.0
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      /install-cni.sh
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      CNI_CONF_NAME:         10-calico.conflist
      CNI_NETWORK_CONFIG:    <set to the key 'cni_network_config' of config map 'calico-config'>  Optional: false
      KUBERNETES_NODE_NAME:   (v1:spec.nodeName)
      CNI_MTU:               <set to the key 'veth_mtu' of config map 'calico-config'>  Optional: false
      SLEEP:                 false
    Mounts:
      /host/etc/cni/net.d from cni-net-dir (rw)
      /host/opt/cni/bin from cni-bin-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-n5wk2 (ro)
  flexvol-driver:
    Container ID:   
    Image:          calico/pod2daemon-flexvol:v3.8.0
    Image ID:       
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /host/driver from flexvol-driver-host (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-n5wk2 (ro)
Containers:
  calico-node:
    Container ID:   
    Image:          calico/node:v3.8.0
    Image ID:       
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:      250m
    Liveness:   http-get http://localhost:9099/liveness delay=10s timeout=1s period=10s #success=1 #failure=6
    Readiness:  exec [/bin/calico-node -bird-ready -felix-ready] delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      DATASTORE_TYPE:                     kubernetes
      WAIT_FOR_DATASTORE:                 true
      NODENAME:                            (v1:spec.nodeName)
      CALICO_NETWORKING_BACKEND:          <set to the key 'calico_backend' of config map 'calico-config'>  Optional: false
      CLUSTER_TYPE:                       k8s,bgp
      IP:                                 autodetect
      CALICO_IPV4POOL_IPIP:               Always
      FELIX_IPINIPMTU:                    <set to the key 'veth_mtu' of config map 'calico-config'>  Optional: false
      CALICO_IPV4POOL_CIDR:               192.168.0.0/16
      CALICO_DISABLE_FILE_LOGGING:        true
      FELIX_DEFAULTENDPOINTTOHOSTACTION:  ACCEPT
      FELIX_IPV6SUPPORT:                  false
      FELIX_LOGSEVERITYSCREEN:            info
      FELIX_HEALTHENABLED:                true
    Mounts:
      /lib/modules from lib-modules (ro)
      /run/xtables.lock from xtables-lock (rw)
      /var/lib/calico from var-lib-calico (rw)
      /var/run/calico from var-run-calico (rw)
      /var/run/nodeagent from policysync (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-n5wk2 (ro)
Conditions:
  Type              Status
  Initialized       False 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  lib-modules:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/modules
    HostPathType:  
  var-run-calico:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/calico
    HostPathType:  
  var-lib-calico:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/calico
    HostPathType:  
  xtables-lock:
    Type:          HostPath (bare host directory volume)
    Path:          /run/xtables.lock
    HostPathType:  FileOrCreate
  cni-bin-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /opt/cni/bin
    HostPathType:  
  cni-net-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/cni/net.d
    HostPathType:  
  host-local-net-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/cni/networks
    HostPathType:  
  policysync:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/nodeagent
    HostPathType:  DirectoryOrCreate
  flexvol-driver-host:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
    HostPathType:  DirectoryOrCreate
  calico-node-token-n5wk2:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  calico-node-token-n5wk2
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  beta.kubernetes.io/os=linux
Tolerations:     :NoSchedule
                 :NoExecute
                 CriticalAddonsOnly
                 node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/network-unavailable:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/pid-pressure:NoSchedule
                 node.kubernetes.io/unreachable:NoExecute
                 node.kubernetes.io/unschedulable:NoSchedule
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  7m11s  default-scheduler  Successfully assigned kube-system/calico-node-flgbf to cherokee
  Normal  Pulled     7m8s   kubelet, cherokee  Container image "calico/cni:v3.8.0" already present on machine
  Normal  Created    7m7s   kubelet, cherokee  Created container upgrade-ipam
  Normal  Started    7m7s   kubelet, cherokee  Started container upgrade-ipam

---

kubectl describe pod coredns-5c98db65d4-wpc7p --namespace=kube-system
Name:                 coredns-5c98db65d4-wpc7p
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 <none>
Labels:               k8s-app=kube-dns
                      pod-template-hash=5c98db65d4
Annotations:          <none>
Status:               Pending
IP:                   
Controlled By:        ReplicaSet/coredns-5c98db65d4
Containers:
  coredns:
    Image:       k8s.gcr.io/coredns:1.3.1
    Ports:       53/UDP, 53/TCP, 9153/TCP
    Host Ports:  0/UDP, 0/TCP, 0/TCP
    Args:
      -conf
      /etc/coredns/Corefile
    Limits:
      memory:  170Mi
    Requests:
      cpu:        100m
      memory:     70Mi
    Liveness:     http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
    Readiness:    http-get http://:8080/health delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /etc/coredns from config-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from coredns-token-nt88z (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      coredns
    Optional:  false
  coredns-token-nt88z:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  coredns-token-nt88z
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  beta.kubernetes.io/os=linux
Tolerations:     CriticalAddonsOnly
                 node-role.kubernetes.io/master:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  43s (x7 over 8m18s)  default-scheduler  0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.

---

kubectl describe pod coredns-5c98db65d4-z7sgv --namespace=kube-system
Name:                 coredns-5c98db65d4-z7sgv
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 <none>
Labels:               k8s-app=kube-dns
                      pod-template-hash=5c98db65d4
Annotations:          <none>
Status:               Pending
IP:                   
Controlled By:        ReplicaSet/coredns-5c98db65d4
Containers:
  coredns:
    Image:       k8s.gcr.io/coredns:1.3.1
    Ports:       53/UDP, 53/TCP, 9153/TCP
    Host Ports:  0/UDP, 0/TCP, 0/TCP
    Args:
      -conf
      /etc/coredns/Corefile
    Limits:
      memory:  170Mi
    Requests:
      cpu:        100m
      memory:     70Mi
    Liveness:     http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
    Readiness:    http-get http://:8080/health delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /etc/coredns from config-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from coredns-token-nt88z (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      coredns
    Optional:  false
  coredns-token-nt88z:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  coredns-token-nt88z
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  beta.kubernetes.io/os=linux
Tolerations:     CriticalAddonsOnly
                 node-role.kubernetes.io/master:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  76s (x8 over 8m51s)  default-scheduler  0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.

staticdev on 3 Jul 2019

OK, so most of the pods are failing to schedule because calico/node hasn't started yet, I think. What about logs for calico/node? That might tell us why the init container isn't finishing.

fasaxc on 3 Jul 2019

@fasaxc I've posted the kubectl describe pod calico-node-flgbf --namespace=kube-system for calico-node. How can I get logs from a pods that hasn't started?

staticdev on 3 Jul 2019

I think kubectl logs can show the init container's log.

fasaxc on 3 Jul 2019

@fasaxc The only output I get (the other pending pods have no output):

kubectl logs calico-node-f8hb5 --namespace=kube-system
Error from server (BadRequest): container "calico-node" in pod "calico-node-f8hb5" is waiting to start: PodInitializing

staticdev on 3 Jul 2019

This explains how to get the log: https://kubernetes.io/docs/tasks/debug-application-cluster/debug-init-containers/

fasaxc on 3 Jul 2019

👍1

The upgrade-ipam InitContainer has trillions of errors:

2019-07-03 20:57:26.014 [INFO][1] migrate.go 65: checking host-local IPAM data dir dir existence...
2019-07-03 20:57:26.014 [INFO][1] migrate.go 72: retrieving node for IPIP tunnel address
2019-07-03 20:57:26.019 [INFO][1] migrate.go 80: IPIP tunnel address not found, assigning...
2019-07-03 20:57:26.024 [INFO][1] ipam.go 583: Assigning IP 192.168.0.1 to host: kadet
2019-07-03 20:57:26.026 [ERROR][1] ipam_plugin.go 95: failed to migrate ipam, retrying... error=failed to get add IPIP tunnel addr 192.168.0.1: The provided IP address is not in a configured pool
 node="kadet"
2019-07-03 20:57:27.027 [INFO][1] migrate.go 65: checking host-local IPAM data dir dir existence...
2019-07-03 20:57:27.027 [INFO][1] migrate.go 72: retrieving node for IPIP tunnel address
2019-07-03 20:57:27.028 [INFO][1] migrate.go 80: IPIP tunnel address not found, assigning...
2019-07-03 20:57:27.030 [INFO][1] ipam.go 583: Assigning IP 192.168.0.1 to host: kadet
2019-07-03 20:57:27.031 [ERROR][1] ipam_plugin.go 95: failed to migrate ipam, retrying... error=failed to get add IPIP tunnel addr 192.168.0.1: The provided IP address is not in a configured pool
 node="kadet"
2019-07-03 20:57:28.031 [INFO][1] migrate.go 65: checking host-local IPAM data dir dir existence...
2019-07-03 20:57:28.031 [INFO][1] migrate.go 72: retrieving node for IPIP tunnel address
2019-07-03 20:57:28.033 [INFO][1] migrate.go 80: IPIP tunnel address not found, assigning...
2019-07-03 20:57:28.036 [INFO][1] ipam.go 583: Assigning IP 192.168.0.1 to host: kadet
2019-07-03 20:57:28.038 [ERROR][1] ipam_plugin.go 95: failed to migrate ipam, retrying... error=failed to get add IPIP tunnel addr 192.168.0.1: The provided IP address is not in a configured pool
 node="kadet"
2019-07-03 20:57:29.038 [INFO][1] migrate.go 65: checking host-local IPAM data dir dir existence...
2019-07-03 20:57:29.038 [INFO][1] migrate.go 72: retrieving node for IPIP tunnel address
2019-07-03 20:57:29.041 [INFO][1] migrate.go 80: IPIP tunnel address not found, assigning...
2019-07-03 20:57:29.044 [INFO][1] ipam.go 583: Assigning IP 192.168.0.1 to host: kadet
2019-07-03 20:57:29.045 [ERROR][1] ipam_plugin.go 95: failed to migrate ipam, retrying... error=failed to get add IPIP tunnel addr 192.168.0.1: The provided IP address is not in a configured pool
 node="kadet"

install-cni and flexvol-driver are waiting to start.

staticdev on 3 Jul 2019

Same behavior on my newborn cluster. My cluster is initialized using kubeadm, and applying calico manifests from either 3.6 or 3.8 leads to these errors in the upgrade-ipam init container.

As I'm on a fresh installation that doesn't need this "upgrade IPAM" stage (AFAIK), I've tried to delete this init container from the manifest (`kubectl edit daemonset -n kube-system calico-node) and everything went fine, issue resolved.

demikl on 5 Jul 2019

I've found why the upgrade-ipam init container is looping endlessly through errors and thus preventing the next init container install-cni to run: my server has this folder existing:

/var/lib/cni/networks/k8s-pod-network

I think it comes from a previous installation. Having done a kubeadm reset was not enough.

Source : https://github.com/projectcalico/cni-plugin/blob/v3.8.0/pkg/upgrade/migrate.go#L66

demikl on 5 Jul 2019

👍2

@demikl can it be considered a Calico bug or kubeadm bug? I will try deleting it tomorrow. Thanks

staticdev on 6 Jul 2019

i am already fix it.

kubeadm reset
ifconfig cni0 down
ip link delete cni0
ifconfig flannel.1 down
ip link delete flannel.1
rm -rf /var/lib/cni/

and then excute kubeadm init.....

withlin on 6 Jul 2019

👍5 ❤1 🎉1

Deleting /var/lib/cni/ solves the problem. Are you doing a patch for migrate.go @withlin?

staticdev on 6 Jul 2019

no. re-install. it is ok.

withlin on 6 Jul 2019

Maybe it's worth at least adding that to calico getting started documentation as a note.

staticdev on 6 Jul 2019

@staticdev Have you solved the problem?

withlin on 7 Jul 2019

@withlin as I said yesterday: "Deleting /var/lib/cni/ solves the problem". =)

staticdev on 7 Jul 2019

👍5

ok. i think that you can close the issue. tks.

withlin on 7 Jul 2019

@withlin Shouldn't this information be added in the documentation to prevent future issues like this one?

staticdev on 7 Jul 2019

@staticdev yes, that'd make a nice PR.

fasaxc on 8 Jul 2019

2019-07-03 20:57:26.026 [ERROR][1] ipam_plugin.go 95: failed to migrate ipam, retrying... error=failed to get add IPIP tunnel addr 192.168.0.1: The provided IP address is not in a configured pool

Docs is probably good enough, though I feel we should be able to remove the need for a docs change here with some code adjustments. Ideally a kubeadm reset would be enough, though it seems it leaves behind some cruft on the node that tricks Calico into thinking it doing an upgrade rather than a fresh installation.

I think the following sounds like a reasonable solution:

Update kubeadm reset documentation to include that this directory should be removed.
Update the init script to check more explicitly if it is doing an upgrade. Straw man: we can check the ClusterInformation CRD that Calico writes to see if this is a new cluster or not. If it is a new cluster, we can skip the upgrade altogether.