Calico: Calico 3.6-3.8 Pending forever on single-node Kubeadm

Created on 3 Jul 2019  路  25Comments  路  Source: projectcalico/calico

I am following the documentation on https://docs.projectcalico.org/v3.8/getting-started/kubernetes/ and pods are pending forever.

Expected Behavior

All pods running in watch kubectl get pods --all-namespaces.

Current Behavio

When I watch kubectl get pods --all-namespaces over 15 minutes, it is still pending:

NAMESPACE     NAME                                       READY   STATUS     RESTARTS   AGE
kube-system   calico-kube-controllers-59f54d6bbc-k8dhf   0/1     Pending    0          17m
kube-system   calico-node-r7v2n                          0/1     Init:0/3   0          17m
kube-system   coredns-5c98db65d4-5jrl2                   0/1     Pending    0          17m
kube-system   coredns-5c98db65d4-d2rc4                   0/1     Pending    0          17m
kube-system   etcd-cherokee                              1/1     Running    0          16m
kube-system   kube-apiserver-cherokee                    1/1     Running    0          16m
kube-system   kube-controller-manager-cherokee           1/1     Running    0          16m
kube-system   kube-proxy-hfp6c                           1/1     Running    0          17m
kube-system   kube-scheduler-cherokee                    1/1     Running    0          16m

Possible Solution

I don't know why, but version 3.5 just works:

 curl \
   https://docs.projectcalico.org/v3.5/getting-started/kubernetes/installation/hosted/kubernetes-datastore/calico-networking/1.7/calico.yaml \
   -O
   kubectl apply -f calico.yaml

I've tried 3.6, 3.7 and 3.8, with the same results.

Steps to Reproduce (for bugs)

  1. Exec:
sudo kubeadm init --pod-network-cidr=192.168.0.0/16
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
kubectl apply -f https://docs.projectcalico.org/v3.8/manifests/calico.yaml
watch kubectl get pods --all-namespaces
  1. Wait 15 minutes

Context

I can't create a cluster with newer versions of Calico.

Your Environment

  • Calico version 3.6-3.8
  • Orchestrator version (e.g. kubernetes, mesos, rkt): kubeadm/kubernetes 1.15.0
  • Operating System and version: Ubuntu 18.04.2 LTS
  • Link to your project (optional): -
kinsupport

Most helpful comment

i am already fix it.

kubeadm reset
ifconfig cni0 down
ip link delete cni0
ifconfig flannel.1 down
ip link delete flannel.1
rm -rf /var/lib/cni/

and then excute kubeadm init.....

All 25 comments

the same problem.

Could it be a resource request? Maybe your node isn't big enough?

What does kubectl describe pod say for each of the non-running pods?

@fasaxc I don't think it is a resource problem, since it is a dedicated machine with 4 core and 16GB dedicated to it and nothing else is running.

@fasaxc All kubectl describes:

kubectl describe pod calico-kube-controllers-59f54d6bbc-gbj95 --namespace=kube-system
Name:                 calico-kube-controllers-59f54d6bbc-gbj95
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 <none>
Labels:               k8s-app=calico-kube-controllers
                      pod-template-hash=59f54d6bbc
Annotations:          scheduler.alpha.kubernetes.io/critical-pod: 
Status:               Pending
IP:                   
Controlled By:        ReplicaSet/calico-kube-controllers-59f54d6bbc
Containers:
  calico-kube-controllers:
    Image:      calico/kube-controllers:v3.8.0
    Port:       <none>
    Host Port:  <none>
    Readiness:  exec [/usr/bin/check-status -r] delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      ENABLED_CONTROLLERS:  node
      DATASTORE_TYPE:       kubernetes
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from calico-kube-controllers-token-778vt (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  calico-kube-controllers-token-778vt:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  calico-kube-controllers-token-778vt
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  beta.kubernetes.io/os=linux
Tolerations:     CriticalAddonsOnly
                 node-role.kubernetes.io/master:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  41s (x7 over 6m35s)  default-scheduler  0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.

---

kubectl describe pod calico-node-flgbf --namespace=kube-system
Name:                 calico-node-flgbf
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 cherokee/150.164.7.70
Start Time:           Wed, 03 Jul 2019 06:57:19 -0300
Labels:               controller-revision-hash=844ddd97c6
                      k8s-app=calico-node
                      pod-template-generation=1
Annotations:          scheduler.alpha.kubernetes.io/critical-pod: 
Status:               Pending
IP:                   150.164.7.70
Controlled By:        DaemonSet/calico-node
Init Containers:
  upgrade-ipam:
    Container ID:  docker://8b0acccf0d1f633b1af29d8cfe2f5b45a53b074e16da4d74b0eca79f4df2ecc6
    Image:         calico/cni:v3.8.0
    Image ID:      docker-pullable://calico/cni@sha256:decba0501ab0658e6e7da2f5625f1eabb8aba5690f9206caba3bf98caca5094c
    Port:          <none>
    Host Port:     <none>
    Command:
      /opt/cni/bin/calico-ipam
      -upgrade
    State:          Running
      Started:      Wed, 03 Jul 2019 06:57:23 -0300
    Ready:          False
    Restart Count:  0
    Environment:
      KUBERNETES_NODE_NAME:        (v1:spec.nodeName)
      CALICO_NETWORKING_BACKEND:  <set to the key 'calico_backend' of config map 'calico-config'>  Optional: false
    Mounts:
      /host/opt/cni/bin from cni-bin-dir (rw)
      /var/lib/cni/networks from host-local-net-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-n5wk2 (ro)
  install-cni:
    Container ID:  
    Image:         calico/cni:v3.8.0
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      /install-cni.sh
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      CNI_CONF_NAME:         10-calico.conflist
      CNI_NETWORK_CONFIG:    <set to the key 'cni_network_config' of config map 'calico-config'>  Optional: false
      KUBERNETES_NODE_NAME:   (v1:spec.nodeName)
      CNI_MTU:               <set to the key 'veth_mtu' of config map 'calico-config'>  Optional: false
      SLEEP:                 false
    Mounts:
      /host/etc/cni/net.d from cni-net-dir (rw)
      /host/opt/cni/bin from cni-bin-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-n5wk2 (ro)
  flexvol-driver:
    Container ID:   
    Image:          calico/pod2daemon-flexvol:v3.8.0
    Image ID:       
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /host/driver from flexvol-driver-host (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-n5wk2 (ro)
Containers:
  calico-node:
    Container ID:   
    Image:          calico/node:v3.8.0
    Image ID:       
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:      250m
    Liveness:   http-get http://localhost:9099/liveness delay=10s timeout=1s period=10s #success=1 #failure=6
    Readiness:  exec [/bin/calico-node -bird-ready -felix-ready] delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      DATASTORE_TYPE:                     kubernetes
      WAIT_FOR_DATASTORE:                 true
      NODENAME:                            (v1:spec.nodeName)
      CALICO_NETWORKING_BACKEND:          <set to the key 'calico_backend' of config map 'calico-config'>  Optional: false
      CLUSTER_TYPE:                       k8s,bgp
      IP:                                 autodetect
      CALICO_IPV4POOL_IPIP:               Always
      FELIX_IPINIPMTU:                    <set to the key 'veth_mtu' of config map 'calico-config'>  Optional: false
      CALICO_IPV4POOL_CIDR:               192.168.0.0/16
      CALICO_DISABLE_FILE_LOGGING:        true
      FELIX_DEFAULTENDPOINTTOHOSTACTION:  ACCEPT
      FELIX_IPV6SUPPORT:                  false
      FELIX_LOGSEVERITYSCREEN:            info
      FELIX_HEALTHENABLED:                true
    Mounts:
      /lib/modules from lib-modules (ro)
      /run/xtables.lock from xtables-lock (rw)
      /var/lib/calico from var-lib-calico (rw)
      /var/run/calico from var-run-calico (rw)
      /var/run/nodeagent from policysync (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-n5wk2 (ro)
Conditions:
  Type              Status
  Initialized       False 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  lib-modules:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/modules
    HostPathType:  
  var-run-calico:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/calico
    HostPathType:  
  var-lib-calico:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/calico
    HostPathType:  
  xtables-lock:
    Type:          HostPath (bare host directory volume)
    Path:          /run/xtables.lock
    HostPathType:  FileOrCreate
  cni-bin-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /opt/cni/bin
    HostPathType:  
  cni-net-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/cni/net.d
    HostPathType:  
  host-local-net-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/cni/networks
    HostPathType:  
  policysync:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/nodeagent
    HostPathType:  DirectoryOrCreate
  flexvol-driver-host:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
    HostPathType:  DirectoryOrCreate
  calico-node-token-n5wk2:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  calico-node-token-n5wk2
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  beta.kubernetes.io/os=linux
Tolerations:     :NoSchedule
                 :NoExecute
                 CriticalAddonsOnly
                 node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/network-unavailable:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/pid-pressure:NoSchedule
                 node.kubernetes.io/unreachable:NoExecute
                 node.kubernetes.io/unschedulable:NoSchedule
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  7m11s  default-scheduler  Successfully assigned kube-system/calico-node-flgbf to cherokee
  Normal  Pulled     7m8s   kubelet, cherokee  Container image "calico/cni:v3.8.0" already present on machine
  Normal  Created    7m7s   kubelet, cherokee  Created container upgrade-ipam
  Normal  Started    7m7s   kubelet, cherokee  Started container upgrade-ipam

---

kubectl describe pod coredns-5c98db65d4-wpc7p --namespace=kube-system
Name:                 coredns-5c98db65d4-wpc7p
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 <none>
Labels:               k8s-app=kube-dns
                      pod-template-hash=5c98db65d4
Annotations:          <none>
Status:               Pending
IP:                   
Controlled By:        ReplicaSet/coredns-5c98db65d4
Containers:
  coredns:
    Image:       k8s.gcr.io/coredns:1.3.1
    Ports:       53/UDP, 53/TCP, 9153/TCP
    Host Ports:  0/UDP, 0/TCP, 0/TCP
    Args:
      -conf
      /etc/coredns/Corefile
    Limits:
      memory:  170Mi
    Requests:
      cpu:        100m
      memory:     70Mi
    Liveness:     http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
    Readiness:    http-get http://:8080/health delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /etc/coredns from config-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from coredns-token-nt88z (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      coredns
    Optional:  false
  coredns-token-nt88z:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  coredns-token-nt88z
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  beta.kubernetes.io/os=linux
Tolerations:     CriticalAddonsOnly
                 node-role.kubernetes.io/master:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  43s (x7 over 8m18s)  default-scheduler  0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.

---

kubectl describe pod coredns-5c98db65d4-z7sgv --namespace=kube-system
Name:                 coredns-5c98db65d4-z7sgv
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 <none>
Labels:               k8s-app=kube-dns
                      pod-template-hash=5c98db65d4
Annotations:          <none>
Status:               Pending
IP:                   
Controlled By:        ReplicaSet/coredns-5c98db65d4
Containers:
  coredns:
    Image:       k8s.gcr.io/coredns:1.3.1
    Ports:       53/UDP, 53/TCP, 9153/TCP
    Host Ports:  0/UDP, 0/TCP, 0/TCP
    Args:
      -conf
      /etc/coredns/Corefile
    Limits:
      memory:  170Mi
    Requests:
      cpu:        100m
      memory:     70Mi
    Liveness:     http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
    Readiness:    http-get http://:8080/health delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /etc/coredns from config-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from coredns-token-nt88z (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      coredns
    Optional:  false
  coredns-token-nt88z:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  coredns-token-nt88z
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  beta.kubernetes.io/os=linux
Tolerations:     CriticalAddonsOnly
                 node-role.kubernetes.io/master:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  76s (x8 over 8m51s)  default-scheduler  0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.

OK, so most of the pods are failing to schedule because calico/node hasn't started yet, I think. What about logs for calico/node? That might tell us why the init container isn't finishing.

@fasaxc I've posted the kubectl describe pod calico-node-flgbf --namespace=kube-system for calico-node. How can I get logs from a pods that hasn't started?

I think kubectl logs can show the init container's log.

@fasaxc The only output I get (the other pending pods have no output):

kubectl logs calico-node-f8hb5 --namespace=kube-system
Error from server (BadRequest): container "calico-node" in pod "calico-node-f8hb5" is waiting to start: PodInitializing

The upgrade-ipam InitContainer has trillions of errors:

2019-07-03 20:57:26.014 [INFO][1] migrate.go 65: checking host-local IPAM data dir dir existence...
2019-07-03 20:57:26.014 [INFO][1] migrate.go 72: retrieving node for IPIP tunnel address
2019-07-03 20:57:26.019 [INFO][1] migrate.go 80: IPIP tunnel address not found, assigning...
2019-07-03 20:57:26.024 [INFO][1] ipam.go 583: Assigning IP 192.168.0.1 to host: kadet
2019-07-03 20:57:26.026 [ERROR][1] ipam_plugin.go 95: failed to migrate ipam, retrying... error=failed to get add IPIP tunnel addr 192.168.0.1: The provided IP address is not in a configured pool
 node="kadet"
2019-07-03 20:57:27.027 [INFO][1] migrate.go 65: checking host-local IPAM data dir dir existence...
2019-07-03 20:57:27.027 [INFO][1] migrate.go 72: retrieving node for IPIP tunnel address
2019-07-03 20:57:27.028 [INFO][1] migrate.go 80: IPIP tunnel address not found, assigning...
2019-07-03 20:57:27.030 [INFO][1] ipam.go 583: Assigning IP 192.168.0.1 to host: kadet
2019-07-03 20:57:27.031 [ERROR][1] ipam_plugin.go 95: failed to migrate ipam, retrying... error=failed to get add IPIP tunnel addr 192.168.0.1: The provided IP address is not in a configured pool
 node="kadet"
2019-07-03 20:57:28.031 [INFO][1] migrate.go 65: checking host-local IPAM data dir dir existence...
2019-07-03 20:57:28.031 [INFO][1] migrate.go 72: retrieving node for IPIP tunnel address
2019-07-03 20:57:28.033 [INFO][1] migrate.go 80: IPIP tunnel address not found, assigning...
2019-07-03 20:57:28.036 [INFO][1] ipam.go 583: Assigning IP 192.168.0.1 to host: kadet
2019-07-03 20:57:28.038 [ERROR][1] ipam_plugin.go 95: failed to migrate ipam, retrying... error=failed to get add IPIP tunnel addr 192.168.0.1: The provided IP address is not in a configured pool
 node="kadet"
2019-07-03 20:57:29.038 [INFO][1] migrate.go 65: checking host-local IPAM data dir dir existence...
2019-07-03 20:57:29.038 [INFO][1] migrate.go 72: retrieving node for IPIP tunnel address
2019-07-03 20:57:29.041 [INFO][1] migrate.go 80: IPIP tunnel address not found, assigning...
2019-07-03 20:57:29.044 [INFO][1] ipam.go 583: Assigning IP 192.168.0.1 to host: kadet
2019-07-03 20:57:29.045 [ERROR][1] ipam_plugin.go 95: failed to migrate ipam, retrying... error=failed to get add IPIP tunnel addr 192.168.0.1: The provided IP address is not in a configured pool
 node="kadet"

install-cni and flexvol-driver are waiting to start.

Same behavior on my newborn cluster. My cluster is initialized using kubeadm, and applying calico manifests from either 3.6 or 3.8 leads to these errors in the upgrade-ipam init container.

As I'm on a fresh installation that doesn't need this "upgrade IPAM" stage (AFAIK), I've tried to delete this init container from the manifest (`kubectl edit daemonset -n kube-system calico-node) and everything went fine, issue resolved.

I've found why the upgrade-ipam init container is looping endlessly through errors and thus preventing the next init container install-cni to run: my server has this folder existing:

/var/lib/cni/networks/k8s-pod-network

I think it comes from a previous installation. Having done a kubeadm reset was not enough.

Source : https://github.com/projectcalico/cni-plugin/blob/v3.8.0/pkg/upgrade/migrate.go#L66

@demikl can it be considered a Calico bug or kubeadm bug? I will try deleting it tomorrow. Thanks

i am already fix it.

kubeadm reset
ifconfig cni0 down
ip link delete cni0
ifconfig flannel.1 down
ip link delete flannel.1
rm -rf /var/lib/cni/

and then excute kubeadm init.....

Deleting /var/lib/cni/ solves the problem. Are you doing a patch for migrate.go @withlin?

no. re-install. it is ok.

Maybe it's worth at least adding that to calico getting started documentation as a note.

@staticdev Have you solved the problem?

@withlin as I said yesterday: "Deleting /var/lib/cni/ solves the problem". =)

ok. i think that you can close the issue. tks.

@withlin Shouldn't this information be added in the documentation to prevent future issues like this one?

@staticdev yes, that'd make a nice PR.

2019-07-03 20:57:26.026 [ERROR][1] ipam_plugin.go 95: failed to migrate ipam, retrying... error=failed to get add IPIP tunnel addr 192.168.0.1: The provided IP address is not in a configured pool

Docs is probably good enough, though I feel we should be able to remove the need for a docs change here with some code adjustments. Ideally a kubeadm reset would be enough, though it seems it leaves behind some cruft on the node that tricks Calico into thinking it doing an upgrade rather than a fresh installation.

I think the following sounds like a reasonable solution:

  • Update kubeadm reset documentation to include that this directory should be removed.
  • Update the init script to check more explicitly if it is doing an upgrade. Straw man: we can check the ClusterInformation CRD that Calico writes to see if this is a new cluster or not. If it is a new cluster, we can skip the upgrade altogether.

The reason for this happening is that the pod has no tolerations for running on master nodes.

Was this page helpful?
0 / 5 - 0 ratings