Kops: DNS issues with calico and kops 1.6

Created on 31 May 2017 · 29Comments · Source: kubernetes/kops

Hello,
I'm experiencing some DNS issue with kops and calico with kops 1.6 and kubernetes 1.6.2 (AWS cloud)

Steps to reproduce:

1- Create a cluster from scratch. I've used this command:
kops create cluster --zones eu-central-1b --node-size t2.medium --master-size t2.medium --vpc vpc-38094c51 --networking calico $NAME

2- After the cluster is created, kube-dns will fail because the configure-calico jobs fails as well:

kube-system   kube-dns-1321724180-gdfjs                                              0/3       ContainerCreating   0          16m
kube-system   kube-dns-autoscaler-265231812-cc19z                                    0/1       ContainerCreating   0          16m

kubectl get jobs --all-namespaces
NAMESPACE     NAME               DESIRED   SUCCESSFUL   AGE
kube-system   configure-calico   1         0            17m

3 - Executing kops rolling-update cluster --force --yes eventually solves this problem

kube-system   kube-dns-1321724180-nnjwf                                             3/3       Running   0          4m
kube-system   kube-dns-1321724180-s2p7j                                             3/3       Running   0          4m

kubectl get jobs --all-namespaces
NAMESPACE     NAME               DESIRED   SUCCESSFUL   AGE
kube-system   configure-calico   1         1            42m

4- At this point, creating a new pod causes temporary dns resolution failures, that eventually get solved by themselves after some time the pod is created (infact the kube-dns pods are running fine, as well as calico ones):

root@test-803530966-gvhc6:/# apt update
Err:1 http://security.ubuntu.com/ubuntu xenial-security InRelease        
  Temporary failure resolving 'security.ubuntu.com'
Err:2 http://archive.ubuntu.com/ubuntu xenial InRelease                  
  Temporary failure resolving 'archive.ubuntu.com'
Err:3 http://archive.ubuntu.com/ubuntu xenial-updates InRelease
  Temporary failure resolving 'archive.ubuntu.com'
Err:4 http://archive.ubuntu.com/ubuntu xenial-backports InRelease
  Temporary failure resolving 'archive.ubuntu.com'
Reading package lists... Done        
Building dependency tree       
Reading state information... Done
All packages are up to date.
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/xenial/InRelease  Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/xenial-updates/InRelease  Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/xenial-backports/InRelease  Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://security.ubuntu.com/ubuntu/dists/xenial-security/InRelease  Temporary failure resolving 'security.ubuntu.com'
W: Some index files failed to download. They have been ignored, or old ones used instead.
root@test-803530966-gvhc6:/# apt update
Err:1 http://archive.ubuntu.com/ubuntu xenial InRelease                  
  Temporary failure resolving 'archive.ubuntu.com'
Err:2 http://security.ubuntu.com/ubuntu xenial-security InRelease        
  Temporary failure resolving 'security.ubuntu.com'
Err:3 http://archive.ubuntu.com/ubuntu xenial-updates InRelease          
  Temporary failure resolving 'archive.ubuntu.com'
Get:4 http://archive.ubuntu.com/ubuntu xenial-backports InRelease [102 kB]
Get:5 http://archive.ubuntu.com/ubuntu xenial-backports/main amd64 Packages [4927 B]
Get:6 http://archive.ubuntu.com/ubuntu xenial-backports/universe amd64 Packages [6022 B]
Fetched 113 kB in 40s (2814 B/s)   
Reading package lists... Done
Building dependency tree       
Reading state information... Done
All packages are up to date.
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/xenial/InRelease  Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/xenial-updates/InRelease  Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://security.ubuntu.com/ubuntu/dists/xenial-security/InRelease  Temporary failure resolving 'security.ubuntu.com'

While this happens there's nothing in the kube-dns logs indicating that something went wrong:

kubectl logs kube-dns-1321724180-nnjwf  --namespace=kube-system kubedns
I0531 08:17:31.000131       1 dns.go:49] version: v1.5.2-beta.0+$Format:%h$
I0531 08:17:31.000712       1 server.go:70] Using configuration read from directory: /kube-dns-config%!(EXTRA time.Duration=10s)
I0531 08:17:31.000764       1 server.go:112] FLAG: --alsologtostderr="false"
I0531 08:17:31.000777       1 server.go:112] FLAG: --config-dir="/kube-dns-config"
I0531 08:17:31.000783       1 server.go:112] FLAG: --config-map=""
I0531 08:17:31.000786       1 server.go:112] FLAG: --config-map-namespace="kube-system"
I0531 08:17:31.000790       1 server.go:112] FLAG: --config-period="10s"
I0531 08:17:31.000794       1 server.go:112] FLAG: --dns-bind-address="0.0.0.0"
I0531 08:17:31.000796       1 server.go:112] FLAG: --dns-port="10053"
I0531 08:17:31.000802       1 server.go:112] FLAG: --domain="cluster.local."
I0531 08:17:31.000815       1 server.go:112] FLAG: --federations=""
I0531 08:17:31.000819       1 server.go:112] FLAG: --healthz-port="8081"
I0531 08:17:31.000846       1 server.go:112] FLAG: --initial-sync-timeout="1m0s"
I0531 08:17:31.000850       1 server.go:112] FLAG: --kube-master-url=""
I0531 08:17:31.000856       1 server.go:112] FLAG: --kubecfg-file=""
I0531 08:17:31.000858       1 server.go:112] FLAG: --log-backtrace-at=":0"
I0531 08:17:31.002575       1 server.go:112] FLAG: --log-dir=""
I0531 08:17:31.002595       1 server.go:112] FLAG: --log-flush-frequency="5s"
I0531 08:17:31.002610       1 server.go:112] FLAG: --logtostderr="true"
I0531 08:17:31.002625       1 server.go:112] FLAG: --nameservers=""
I0531 08:17:31.002638       1 server.go:112] FLAG: --stderrthreshold="2"
I0531 08:17:31.002650       1 server.go:112] FLAG: --v="2"
I0531 08:17:31.002683       1 server.go:112] FLAG: --version="false"
I0531 08:17:31.002702       1 server.go:112] FLAG: --vmodule=""
I0531 08:17:31.002766       1 server.go:175] Starting SkyDNS server (0.0.0.0:10053)
I0531 08:17:31.003038       1 server.go:197] Skydns metrics enabled (/metrics:10055)
I0531 08:17:31.003059       1 dns.go:147] Starting endpointsController
I0531 08:17:31.003071       1 dns.go:150] Starting serviceController
I0531 08:17:31.003593       1 logs.go:41] skydns: ready for queries on cluster.local. for tcp://0.0.0.0:10053 [rcache 0]
I0531 08:17:31.003611       1 logs.go:41] skydns: ready for queries on cluster.local. for udp://0.0.0.0:10053 [rcache 0]
I0531 08:17:31.030806       1 dns.go:264] New service: kubernetes
I0531 08:17:31.030909       1 dns.go:462] Added SRV record &{Host:kubernetes.default.svc.cluster.local. Port:443 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:}
I0531 08:17:31.030948       1 dns.go:264] New service: kube-dns
I0531 08:17:31.030974       1 dns.go:462] Added SRV record &{Host:kube-dns.kube-system.svc.cluster.local. Port:53 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:}
I0531 08:17:31.031105       1 dns.go:462] Added SRV record &{Host:kube-dns.kube-system.svc.cluster.local. Port:53 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:}
I0531 08:17:31.503315       1 dns.go:171] Initialized services and endpoints from apiserver
I0531 08:17:31.503337       1 server.go:128] Setting up Healthz Handler (/readiness)
I0531 08:17:31.503343       1 server.go:133] Setting up cache handler (/cache)
I0531 08:17:31.503348       1 server.go:119] Status HTTP port 8081
I0531 08:22:31.004225       1 dns.go:264] New service: kubernetes
I0531 08:22:31.004457       1 dns.go:462] Added SRV record &{Host:kubernetes.default.svc.cluster.local. Port:443 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:}
I0531 08:22:31.004583       1 dns.go:264] New service: kube-dns
I0531 08:22:31.004634       1 dns.go:462] Added SRV record &{Host:kube-dns.kube-system.svc.cluster.local. Port:53 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:}
I0531 08:22:31.004673       1 dns.go:462] Added SRV record &{Host:kube-dns.kube-system.svc.cluster.local. Port:53 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:}
I0531 08:27:31.004378       1 dns.go:264] New service: kubernetes
I0531 08:27:31.004467       1 dns.go:462] Added SRV record &{Host:kubernetes.default.svc.cluster.local. Port:443 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:}
I0531 08:27:31.004502       1 dns.go:264] New service: kube-dns
I0531 08:27:31.004528       1 dns.go:462] Added SRV record &{Host:kube-dns.kube-system.svc.cluster.local. Port:53 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:}
I0531 08:27:31.004549       1 dns.go:462] Added SRV record &{Host:kube-dns.kube-system.svc.cluster.local. Port:53 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:}

kubectl logs kube-dns-1321724180-nnjwf  --namespace=kube-system dnsmasq
I0531 08:17:36.426853       1 main.go:76] opts: {{/usr/sbin/dnsmasq [-k --cache-size=1000 --log-facility=- --server=/cluster.local/127.0.0.1#10053 --server=/in-addr.arpa/127.0.0.1#10053 --server=/in6.arpa/127.0.0.1#10053] true} /etc/k8s/dns/dnsmasq-nanny 10000000000}
I0531 08:17:36.426935       1 nanny.go:86] Starting dnsmasq [-k --cache-size=1000 --log-facility=- --server=/cluster.local/127.0.0.1#10053 --server=/in-addr.arpa/127.0.0.1#10053 --server=/in6.arpa/127.0.0.1#10053]
I0531 08:17:36.512105       1 nanny.go:111] 
W0531 08:17:36.512226       1 nanny.go:112] Got EOF from stdout
I0531 08:17:36.512197       1 nanny.go:108] dnsmasq[11]: started, version 2.76 cachesize 1000
I0531 08:17:36.512253       1 nanny.go:108] dnsmasq[11]: compile time options: IPv6 GNU-getopt no-DBus no-i18n no-IDN DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth no-DNSSEC loop-detect inotify
I0531 08:17:36.512264       1 nanny.go:108] dnsmasq[11]: using nameserver 127.0.0.1#10053 for domain in6.arpa 
I0531 08:17:36.512268       1 nanny.go:108] dnsmasq[11]: using nameserver 127.0.0.1#10053 for domain in-addr.arpa 
I0531 08:17:36.512271       1 nanny.go:108] dnsmasq[11]: using nameserver 127.0.0.1#10053 for domain cluster.local 
I0531 08:17:36.512277       1 nanny.go:108] dnsmasq[11]: reading /etc/resolv.conf
I0531 08:17:36.512281       1 nanny.go:108] dnsmasq[11]: using nameserver 127.0.0.1#10053 for domain in6.arpa 
I0531 08:17:36.512284       1 nanny.go:108] dnsmasq[11]: using nameserver 127.0.0.1#10053 for domain in-addr.arpa 
I0531 08:17:36.512289       1 nanny.go:108] dnsmasq[11]: using nameserver 127.0.0.1#10053 for domain cluster.local 
I0531 08:17:36.512292       1 nanny.go:108] dnsmasq[11]: using nameserver 10.0.0.2#53
I0531 08:17:36.512296       1 nanny.go:108] dnsmasq[11]: read /etc/hosts - 7 addresses



md5-73d159dd1852512b9a9e7d64aed98054



kubectl logs kube-dns-1321724180-nnjwf  --namespace=kube-system sidecar
ERROR: logging before flag.Parse: I0531 08:17:39.156690       1 main.go:49] Version v1.14.0-2-g38b2ca3
ERROR: logging before flag.Parse: I0531 08:17:39.156729       1 server.go:45] Starting server (options {DnsMasqPort:53 DnsMasqAddr:127.0.0.1 DnsMasqPollIntervalMs:5000 Probes:[{Label:kubedns Server:127.0.0.1:10053 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1} {Label:dnsmasq Server:127.0.0.1:53 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1}] PrometheusAddr:0.0.0.0 PrometheusPort:10054 PrometheusPath:/metrics PrometheusNamespace:kubedns})
ERROR: logging before flag.Parse: I0531 08:17:39.156811       1 dnsprobe.go:75] Starting dnsProbe {Label:kubedns Server:127.0.0.1:10053 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1}
ERROR: logging before flag.Parse: I0531 08:17:39.156881       1 dnsprobe.go:75] Starting dnsProbe {Label:dnsmasq Server:127.0.0.1:53 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1}

Source

lc4ntoni-centralway

Most helpful comment

@chrislovecnm can we reopen this since a number of us are still seeing this? I'm seeing the same behavior as @Cisneiros.

mikesplain on 7 Aug 2017

👍3

All 29 comments

@ottoyiu any ideas?

chrislovecnm on 31 May 2017

👍1

@lc4ntoni-centralway This seems like a service account race condition that was fixed by #2590 and #2641 that's not part of a kops release yet. Just to confirm, are you using the tagged release of Kops 1.6?

ottoyiu on 31 May 2017

@ottoyiu thanks for replying. I'm using the kops version shipped in brew. sha1:

5a66319cf9f943015f10ae24764e899cbe5d5445  /usr/local/Cellar/kops/1.6.0/bin/kops

lc4ntoni-centralway on 1 Jun 2017

Not ideal, but try building kops from master, and see if that solves your problem. Pretty sure it's no longer a problem.

ottoyiu on 5 Jun 2017

@justinsb is working on a 1.6.1 release which will include the fixes.

chrislovecnm on 9 Jun 2017

Alright, I understand that some fix need to be included within the next release. I'll close then... thanks a lot for your support.

lc4ntoni-centralway on 12 Jun 2017

Hi guys! I'm running kops 1.6.2 (git-469d82d0b) and created a Kubernetes 1.6.6 cluster with calico and I'm facing the same issues. Has this been fixed yet?

Cisneiros on 1 Aug 2017

Just to add more context: I'm running CentOS 7. I also noticed it seems the issue occurs on recently started nodes (e.g. after cluster-autscaler scaling up).
@lc4ntoni-centralway was your problem solved?

Cisneiros on 2 Aug 2017

We're see this too but haven't been able to pin it down. Saw it most recently on a cluster created with kops v1.7.0 and k8s v1.6.2.

mikesplain on 2 Aug 2017

was it DNS that wasn't working, or were Service IPs also affected? Reason I ask is because DNS is resolved through a service IP of kube-dns'.

ottoyiu on 2 Aug 2017

@ottoyiu Ok, so I tested for that. Apparently Service IPs are not working. I wrote a simple container that tries to wget my-nginx-service-ip over and over until it responds. It failed for a few requests (timeout), and about 2-3 min after starting it succeeded.

This pod was ran on a recently started node. Do these diagnostics give you any ideias?

Cisneiros on 3 Aug 2017

👍1

@chrislovecnm can we reopen this since a number of us are still seeing this? I'm seeing the same behavior as @Cisneiros.

mikesplain on 7 Aug 2017

👍3

+1 for this on CentOS 7 with k8s 1.6.7. Services are continually re-registered every 5 minutes like @lc4ntoni-centralway reports. Other containers in the kube-dns pod report no errors.

I0815 13:40:58.212189 1 dns.go:264] New service: service-grafana I0815 13:40:58.212305 1 dns.go:264] New service: kube-dns I0815 13:40:58.212361 1 dns.go:264] New service: service-alertmanager I0815 13:40:58.212401 1 dns.go:264] New service: kubernetes-dashboard I0815 13:40:58.212421 1 dns.go:264] New service: kubernetes I0815 13:40:58.212471 1 dns.go:264] New service: service-http-backend I0815 13:40:58.212491 1 dns.go:264] New service: nginx-service I0815 13:40:58.212511 1 dns.go:264] New service: portal-service I0815 13:40:58.212532 1 dns.go:264] New service: service-prometheus I0815 13:45:58.212304 1 dns.go:264] New service: nginx-service I0815 13:45:58.212396 1 dns.go:264] New service: portal-service I0815 13:45:58.212421 1 dns.go:264] New service: service-prometheus I0815 13:45:58.212478 1 dns.go:264] New service: kubernetes I0815 13:45:58.212537 1 dns.go:264] New service: service-http-backend I0815 13:45:58.212560 1 dns.go:264] New service: service-alertmanager I0815 13:45:58.212603 1 dns.go:264] New service: kubernetes-dashboard I0815 13:45:58.212623 1 dns.go:264] New service: service-grafana I0815 13:45:58.212662 1 dns.go:264] New service: kube-dns I0815 13:50:58.212437 1 dns.go:264] New service: kubernetes-dashboard I0815 13:50:58.212553 1 dns.go:264] New service: service-grafana I0815 13:50:58.212602 1 dns.go:264] New service: kube-dns I0815 13:50:58.212657 1 dns.go:264] New service: service-alertmanager I0815 13:50:58.212697 1 dns.go:264] New service: service-prometheus I0815 13:50:58.212740 1 dns.go:264] New service: kubernetes I0815 13:50:58.212778 1 dns.go:264] New service: service-http-backend I0815 13:50:58.212799 1 dns.go:264] New service: nginx-service I0815 13:50:58.212818 1 dns.go:264] New service: portal-service

During the periods of adding a new service, DNS fails in the cluster and services cannot communicate. Networking CNI is via Calico.

jbunce on 15 Aug 2017

Scrap that, looks like intended behaviour: https://github.com/kubernetes/dns/blob/master/pkg/dns/dns.go#L52

jbunce on 15 Aug 2017

@Cisneiros @mikesplain I'm also seeing this behaviour on new nodes. Are you running calico using cross-subnet mode or just straight-up ipip?

ottoyiu on 15 Aug 2017

@ottoyiu We're running cross subnet.

mikesplain on 15 Aug 2017

@mikesplain and do you observe it on only new nodes that come up? That's what I'm seeing right now.

ottoyiu on 15 Aug 2017

@ottoyiu Yes exactly. I haven't tried rebooting a node but it's usually about 3-5 min before things are okay. I setup a daemonset to test (below). The apt-get update almost always has issues for ~5 min before finally working.

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: dns-test
spec:
  template:
    spec:
      containers:
      - name: dns-test
        image: ubuntu:16.04
        command:
        - sh
        - -c
        - apt-get update && apt-get install -y curl && sleep 3600

mikesplain on 15 Aug 2017

If it's only solely happening in cross-subnet mode, then it could be AWS taking it's time after the API call to actually disable src/dst checks on the node... or it could be k8s-ec2-srcst controller that I wrote which is taking time to respond to nodes coming up and calling the AWS API. :-1: In those both cases, we'll need to mark the node as NotReady or disable scheduling until it is actually ready.

If it's happening in ipip mode as well, then well... something funky is happening with calico.

ottoyiu on 15 Aug 2017

@ottoyiu Ahh yeah good idea, any ideas on how to debug this further? Let me know if there's anything I can do to help. All our clusters are cross-subnet but I can try to test it in ipip.

mikesplain on 15 Aug 2017

@mikesplain If you can test spinning up a cluster using ipip, that'll be great! I'll also do the same to try and isolate the issue.

ottoyiu on 15 Aug 2017

@ottoyiu Ahh so I just setup a ipip cluster (Kops v1.7.0 & k8s v1.6.2). I wasn't able to replicate the issue.

Then I wanted to do one more test, upgrade the cluster and see what happens. After updating to k8s 1.6.8 I see the issue on a ipip cluster.

mikesplain on 15 Aug 2017

@mikesplain I think we might be seeing this while routes are still being advertised through iBGP using BIRD. Going to check the log timestamps from when the node is marked READY, and when other nodes see the routes for it.

ottoyiu on 15 Aug 2017

Similarly to what I expected, the node is marked as 'READY' when the node's BIRD BGP daemon inside the calico/node has not learnt the routes from its neighbours yet. It is not until 4-5 minutes in, does it establish connection with its peers and learn their routes.

The route table when node is marked as 'READY':

admin@ip-10-67-79-106:~$ ip route
default via 10.67.64.1 dev eth0
10.67.64.0/19 dev eth0  proto kernel  scope link  src 10.67.79.106
172.17.0.0/16 dev docker0  proto kernel  scope link  src 172.17.0.1

It is not until 4-5 minutes in, does it learn routes:

admin@ip-10-67-79-106:~$ ip route                                                                                                                                                                                                                                                                                                                                          │
default via 10.67.64.1 dev eth0                                                                                                                                                                                                                                                                                                                                            │
10.67.64.0/19 dev eth0  proto kernel  scope link  src 10.67.79.106                                                                                                                                                                                                                                                                                                         │
100.100.189.0/26 via 10.67.78.252 dev eth0  proto bird                                                                                                                                                                                                                                                                                                                     │
100.117.226.128/26 via 10.67.68.58 dev eth0  proto bird                                                                                                                                                                                                                                                                                                                    │
100.119.202.128/26 via 10.67.82.42 dev eth0  proto bird                                                                                                                                                                                                                                                                                                                    │
blackhole 100.123.242.64/26  proto bird                                                                                                                                                                                                                                                                                                                                    │
100.125.28.128/26 via 10.67.84.204 dev eth0  proto bird                                                                                                                                                                                                                                                                                                                    │
172.17.0.0/16 dev docker0  proto kernel  scope link  src 172.17.0.1

This timing also lines up quite well with the reports of not being able to resolve hostnames using DNS or connect to anything using a service IP.

@caseydavenport have you seen this before, and if so, do you know of a workaround?

ottoyiu on 15 Aug 2017

@ottoyiu route convergence via BGP should be done in seconds - even at thousand node scale - not minutes, so that seems unlikely.

If there's a timing delay in programming those routes it's more likely because either:

BIRD hasn't started yet on some nodes
No Pods have started on those nodes, and so BIRD won't advertise the routes until one starts.

The only other thing I can think of is that if you've told Calico to peer with a Node that doesn't exist, it will take 90s before the routes will be programmed due to a graceful-restart timer.

I'd check for configured nodes using calicoctl get nodes and see if there are any that shouldn't be there.

You can also run calicoctl node status to get the BGP state for a given node to see if it has converged.

caseydavenport on 15 Aug 2017

@caseydavenport thank you very much for your guidance :) much appreciated!
It does seem like its hitting 90s graceful-restart timer before the routes are programmed!

/ # bird -R -s /var/run/calico/bird.ctl -d -c /etc/calico/confd/config/bird.cfg                                                                                                                                                                                                                                                                                            │
bird: device1: Initializing                                                                                                                                                                                                                                                                                                                                                │bird: direct1: Initializing                                                                                                                                                                                                                                                                                                                                                │bird: Mesh_10_67_64_205: Initializing                                                                                                                                                                                                                                                                                                                                      │bird: Mesh_10_67_68_58: Initializing                                                                                                                                                                                                                                                                                                                                       │bird: Mesh_10_67_76_48: Initializing                                                                                                                                                                                                                                                                                                                                       │bird: Mesh_10_67_78_229: Initializing                                                                                                                                                                                                                                                                                                                                      │bird: Mesh_10_67_78_252: Initializing                                                                                                                                                                                                                                                                                                                                      │bird: Mesh_10_67_79_106: Initializing                                                                                                                                                                                                                                                                                                                                      │bird: Mesh_10_67_82_133: Initializing                                                                                                                                                                                                                                                                                                                                      │bird: Mesh_10_67_82_42: Initializing                                                                                                                                                                                                                                                                                                                                       │bird: Mesh_10_67_84_204: Initializing                                                                                                                                                                                                                                                                                                                                      │bird: Mesh_10_67_88_27: Initializing                                                                                                                                                                                                                                                                                                                                       │
bird: Mesh_10_67_91_228: Initializing                                                                                                                                                                                                                                                                                                                                      │
bird: device1: Starting                                                                                                                                                                                                                                                                                                                                                    │
bird: device1: Connected to table master                                                                                                                                                                                                                                                                                                                                   │
bird: device1: State changed to feed                                                                                                                                                                                                                                                                                                                                       │
bird: direct1: Starting                                                                                                                                                                                                                                                                                                                                                    │
bird: direct1: Connected to table master                                                                                                                                                                                                                                                                                                                                   │
bird: direct1: State changed to feed                                                                                                                                                                                                                                                                                                                                       │
bird: Mesh_10_67_64_205: Starting                                                                                                                                                                                                                                                                                                                                          │
bird: Mesh_10_67_64_205: State changed to start                                                                                                                                                                                                                                                                                                                            │
bird: Mesh_10_67_68_58: Starting                                                                                                                                                                                                                                                                                                                                           │
bird: Mesh_10_67_68_58: State changed to start                                                                                                                                                                                                                                                                                                                             │
bird: Mesh_10_67_76_48: Starting                                                                                                                                                                                                                                                                                                                                           │
bird: Mesh_10_67_76_48: State changed to start                                                                                                                                                                                                                                                                                                                             │
bird: Mesh_10_67_78_229: Starting                                                                                                                                                                                                                                                                                                                                          │
bird: Mesh_10_67_78_229: State changed to start                                                                                                                                                                                                                                                                                                                            │
bird: Mesh_10_67_78_252: Starting                                                                                                                                                                                                                                                                                                                                          │
bird: Mesh_10_67_78_252: State changed to start                                                                                                                                                                                                                                                                                                                            │
bird: Mesh_10_67_79_106: Starting                                                                                                                                                                                                                                                                                                                                          │
bird: Mesh_10_67_79_106: State changed to start                                                                                                                                                                                                                                                                                                                            │
bird: Mesh_10_67_82_133: Starting                                                                                                                                                                                                                                                                                                                                          │
bird: Mesh_10_67_82_133: State changed to start                                                                                                                                                                                                                                                                                                                            │
bird: Mesh_10_67_82_42: Starting                                                                                                                                                                                                                                                                                                                                           │
bird: Mesh_10_67_82_42: State changed to start                                                                                                                                                                                                                                                                                                                             │
bird: Mesh_10_67_84_204: Starting                                                                                                                                                                                                                                                                                                                                          │
bird: Mesh_10_67_84_204: State changed to start                                                                                                                                                                                                                                                                                                                            │
bird: Mesh_10_67_88_27: Starting                                                                                                                                                                                                                                                                                                                                           │
bird: Mesh_10_67_88_27: State changed to start                                                                                                                                                                                                                                                                                                                             │
bird: Mesh_10_67_91_228: Starting                                                                                                                                                                                                                                                                                                                                          │
bird: Mesh_10_67_91_228: State changed to start                                                                                                                                                                                                                                                                                                                            │
bird: Graceful restart started                                                                                                                                                                                                                                                                                                                                             │
bird: Started                                                                                                                                                                                                                                                                                                                                                              │
bird: device1: State changed to up                                                                                                                                                                                                                                                                                                                                         │
bird: direct1: State changed to up                                                                                                                                                                                                                                                                                                                                         │
bird: Mesh_10_67_82_42: Connected to table master                                                                                                                                                                                                                                                                                                                          │
bird: Mesh_10_67_82_42: State changed to wait                                                                                                                                                                                                                                                                                                                              │
bird: Mesh_10_67_79_106: Connected to table master                                                                                                                                                                                                                                                                                                                         │
bird: Mesh_10_67_79_106: State changed to wait                                                                                                                                                                                                                                                                                                                             │bird: Mesh_10_67_68_58: Connected to table master                                                                                                                                                                                                                                                                                                                          │
bird: Mesh_10_67_68_58: State changed to wait                                                                                                                                                                                                                                                                                                                              │bird: Mesh_10_67_78_252: Connected to table master                                                                                                                                                                                                                                                                                                                         │
bird: Mesh_10_67_78_252: State changed to wait                                                                                                                                                                                                                                                                                                                             │
                                                                                                                                                                                                                                                                                                                                                                           │
                                                                                                                                                                                                                                                                                                                                                                           │
                                                                                                                                                                                                                                                                                                                                                                           │
bird: Graceful restart done                                                                                                                                                                                                                                                                                                                                                │
bird: Mesh_10_67_68_58: State changed to feed                                                                                                                                                                                                                                                                                                                              │
bird: Mesh_10_67_78_252: State changed to feed                                                                                                                                                                                                                                                                                                                             │
bird: Mesh_10_67_79_106: State changed to feed                                                                                                                                                                                                                                                                                                                             │
bird: Mesh_10_67_82_42: State changed to feed                                                                                                                                                                                                                                                                                                                              │
bird: Mesh_10_67_68_58: State changed to up                                                                                                                                                                                                                                                                                                                                │
bird: Mesh_10_67_78_252: State changed to up                                                                                                                                                                                                                                                                                                                               │
bird: Mesh_10_67_79_106: State changed to up                                                                                                                                                                                                                                                                                                                               │
bird: Mesh_10_67_82_42: State changed to up

Because of the ephemeral nature of AWS instances, new hosts with new hostnames come up while the old hosts still stay around in calico. This makes BIRD unhappy as it tries to peer with a node that no longer exists. :(

@mikesplain This is why you were able to replicate this behaviour after you did a rolling-update.

oyiu@cayvroyiuworkstn [~][(unknown)*] kubectl run -it --image=calico/ctl:v1.2.1 calicoctl --env CALICO_ETCD_ENDPOINTS=http://etcd-1.internal.i0:4001 get node                                                                                                                                                                                 │
If you don't see a command prompt, try pressing enter.                                                                                                                                                                                                                                                                                                                     │
                                                      Error attaching, falling back to logs: unable to upgrade connection: container calicoctl3 not found in pod calicoctl3-326955638-bt2cr_default                                                                                                                                                                        │
NAME                                                                                                                                                                                                                                                                                                                                                                       │
ip-10-67-64-205                                                                                                                                                                                                                                                                                                                                                            │
ip-10-67-68-58                                                                                                                                                                                                                                                                                                                                                             │
ip-10-67-70-86                                                                                                                                                                                                                                                                                                                                                             │
ip-10-67-76-48                                                                                                                                                                                                                                                                                                                                                             │
ip-10-67-78-229                                                                                                                                                                                                                                                                                                                                                            │
ip-10-67-78-252                                                                                                                                                                                                                                                                                                                                                            │
ip-10-67-79-106                                                                                                                                                                                                                                                                                                                                                            │
ip-10-67-82-133                                                                                                                                                                                                                                                                                                                                                            │
ip-10-67-82-42                                                                                                                                                                                                                                                                                                                                                             │
ip-10-67-84-204                                                                                                                                                                                                                                                                                                                                                            │
ip-10-67-88-27                                                                                                                                                                                                                                                                                                                                                             │
ip-10-67-91-228     

oyiu@cayvroyiuworkstn [~][(unknown)*] kubectl get nodes                                                                                                                                                                                                                                                                                                                    │
NAME                                         STATUS    AGE       VERSION                                                                                                                                                                                                                                                                                                   │
ip-10-67-68-58.us-west-2.compute.internal    Ready     5d        v1.7.3                                                                                                                                                                                                                                                                                                    │
ip-10-67-70-86.us-west-2.compute.internal    Ready     46m       v1.7.3                                                                                                                                                                                                                                                                                                    │
ip-10-67-78-252.us-west-2.compute.internal   Ready     5d        v1.7.3                                                                                                                                                                                                                                                                                                    │
ip-10-67-79-106.us-west-2.compute.internal   Ready     1h        v1.7.3                                                                                                                                                                                                                                                                                                    │
ip-10-67-82-42.us-west-2.compute.internal    Ready     5d        v1.7.3

I'm trying to figure out what the best way is to tackle this problem. Static BGP route reflectors could help in this case. Also, a controller which listens into the kubernetes api, can watch for nodes and delete nodes from calico as they become offline.

ottoyiu on 15 Aug 2017

Oh hey! I found this gem from @caseydavenport :dancing_men: :dancing_women:
https://github.com/caseydavenport/calico-node-controller

I'll see if I have some time to test it out and production-ize it.

Relevant Issue:
https://github.com/projectcalico/canal/issues/11

ottoyiu on 16 Aug 2017

@ottoyiu ahh yeah great point! Great job digging into this!

mikesplain on 16 Aug 2017

I also tested this against https://github.com/kubernetes/kops/pull/3162 to make sure recent change + those in master wouldn't make a different. As expected the issue persisted. Ill take a look at the node-controller in a bit.

mikesplain on 16 Aug 2017

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Sublime does not seem to work on edit

chrislovecnm · 3Comments

Cluster create fails with kops-version.txt not found

mikejoh · 3Comments

Automatic Instance Group labels for Node Affinity scheduling

RXminuS · 5Comments

Support for newer docker versions like 18.03

lnformer · 3Comments

CoreDNS externalCoreFile Parsing Invalid - Indentation

joshbranham · 3Comments