Hello,
I'm experiencing some DNS issue with kops and calico with kops 1.6 and kubernetes 1.6.2 (AWS cloud)
Steps to reproduce:
1- Create a cluster from scratch. I've used this command:
kops create cluster --zones eu-central-1b --node-size t2.medium --master-size t2.medium --vpc vpc-38094c51 --networking calico $NAME
2- After the cluster is created, kube-dns will fail because the configure-calico jobs fails as well:
kube-system kube-dns-1321724180-gdfjs 0/3 ContainerCreating 0 16m
kube-system kube-dns-autoscaler-265231812-cc19z 0/1 ContainerCreating 0 16m
kubectl get jobs --all-namespaces
NAMESPACE NAME DESIRED SUCCESSFUL AGE
kube-system configure-calico 1 0 17m
3 - Executing kops rolling-update cluster --force --yes eventually solves this problem
kube-system kube-dns-1321724180-nnjwf 3/3 Running 0 4m
kube-system kube-dns-1321724180-s2p7j 3/3 Running 0 4m
kubectl get jobs --all-namespaces
NAMESPACE NAME DESIRED SUCCESSFUL AGE
kube-system configure-calico 1 1 42m
4- At this point, creating a new pod causes temporary dns resolution failures, that eventually get solved by themselves after some time the pod is created (infact the kube-dns pods are running fine, as well as calico ones):
root@test-803530966-gvhc6:/# apt update
Err:1 http://security.ubuntu.com/ubuntu xenial-security InRelease
Temporary failure resolving 'security.ubuntu.com'
Err:2 http://archive.ubuntu.com/ubuntu xenial InRelease
Temporary failure resolving 'archive.ubuntu.com'
Err:3 http://archive.ubuntu.com/ubuntu xenial-updates InRelease
Temporary failure resolving 'archive.ubuntu.com'
Err:4 http://archive.ubuntu.com/ubuntu xenial-backports InRelease
Temporary failure resolving 'archive.ubuntu.com'
Reading package lists... Done
Building dependency tree
Reading state information... Done
All packages are up to date.
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/xenial/InRelease Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/xenial-updates/InRelease Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/xenial-backports/InRelease Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://security.ubuntu.com/ubuntu/dists/xenial-security/InRelease Temporary failure resolving 'security.ubuntu.com'
W: Some index files failed to download. They have been ignored, or old ones used instead.
root@test-803530966-gvhc6:/# apt update
Err:1 http://archive.ubuntu.com/ubuntu xenial InRelease
Temporary failure resolving 'archive.ubuntu.com'
Err:2 http://security.ubuntu.com/ubuntu xenial-security InRelease
Temporary failure resolving 'security.ubuntu.com'
Err:3 http://archive.ubuntu.com/ubuntu xenial-updates InRelease
Temporary failure resolving 'archive.ubuntu.com'
Get:4 http://archive.ubuntu.com/ubuntu xenial-backports InRelease [102 kB]
Get:5 http://archive.ubuntu.com/ubuntu xenial-backports/main amd64 Packages [4927 B]
Get:6 http://archive.ubuntu.com/ubuntu xenial-backports/universe amd64 Packages [6022 B]
Fetched 113 kB in 40s (2814 B/s)
Reading package lists... Done
Building dependency tree
Reading state information... Done
All packages are up to date.
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/xenial/InRelease Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/xenial-updates/InRelease Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://security.ubuntu.com/ubuntu/dists/xenial-security/InRelease Temporary failure resolving 'security.ubuntu.com'
While this happens there's nothing in the kube-dns logs indicating that something went wrong:
kubectl logs kube-dns-1321724180-nnjwf --namespace=kube-system kubedns
I0531 08:17:31.000131 1 dns.go:49] version: v1.5.2-beta.0+$Format:%h$
I0531 08:17:31.000712 1 server.go:70] Using configuration read from directory: /kube-dns-config%!(EXTRA time.Duration=10s)
I0531 08:17:31.000764 1 server.go:112] FLAG: --alsologtostderr="false"
I0531 08:17:31.000777 1 server.go:112] FLAG: --config-dir="/kube-dns-config"
I0531 08:17:31.000783 1 server.go:112] FLAG: --config-map=""
I0531 08:17:31.000786 1 server.go:112] FLAG: --config-map-namespace="kube-system"
I0531 08:17:31.000790 1 server.go:112] FLAG: --config-period="10s"
I0531 08:17:31.000794 1 server.go:112] FLAG: --dns-bind-address="0.0.0.0"
I0531 08:17:31.000796 1 server.go:112] FLAG: --dns-port="10053"
I0531 08:17:31.000802 1 server.go:112] FLAG: --domain="cluster.local."
I0531 08:17:31.000815 1 server.go:112] FLAG: --federations=""
I0531 08:17:31.000819 1 server.go:112] FLAG: --healthz-port="8081"
I0531 08:17:31.000846 1 server.go:112] FLAG: --initial-sync-timeout="1m0s"
I0531 08:17:31.000850 1 server.go:112] FLAG: --kube-master-url=""
I0531 08:17:31.000856 1 server.go:112] FLAG: --kubecfg-file=""
I0531 08:17:31.000858 1 server.go:112] FLAG: --log-backtrace-at=":0"
I0531 08:17:31.002575 1 server.go:112] FLAG: --log-dir=""
I0531 08:17:31.002595 1 server.go:112] FLAG: --log-flush-frequency="5s"
I0531 08:17:31.002610 1 server.go:112] FLAG: --logtostderr="true"
I0531 08:17:31.002625 1 server.go:112] FLAG: --nameservers=""
I0531 08:17:31.002638 1 server.go:112] FLAG: --stderrthreshold="2"
I0531 08:17:31.002650 1 server.go:112] FLAG: --v="2"
I0531 08:17:31.002683 1 server.go:112] FLAG: --version="false"
I0531 08:17:31.002702 1 server.go:112] FLAG: --vmodule=""
I0531 08:17:31.002766 1 server.go:175] Starting SkyDNS server (0.0.0.0:10053)
I0531 08:17:31.003038 1 server.go:197] Skydns metrics enabled (/metrics:10055)
I0531 08:17:31.003059 1 dns.go:147] Starting endpointsController
I0531 08:17:31.003071 1 dns.go:150] Starting serviceController
I0531 08:17:31.003593 1 logs.go:41] skydns: ready for queries on cluster.local. for tcp://0.0.0.0:10053 [rcache 0]
I0531 08:17:31.003611 1 logs.go:41] skydns: ready for queries on cluster.local. for udp://0.0.0.0:10053 [rcache 0]
I0531 08:17:31.030806 1 dns.go:264] New service: kubernetes
I0531 08:17:31.030909 1 dns.go:462] Added SRV record &{Host:kubernetes.default.svc.cluster.local. Port:443 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:}
I0531 08:17:31.030948 1 dns.go:264] New service: kube-dns
I0531 08:17:31.030974 1 dns.go:462] Added SRV record &{Host:kube-dns.kube-system.svc.cluster.local. Port:53 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:}
I0531 08:17:31.031105 1 dns.go:462] Added SRV record &{Host:kube-dns.kube-system.svc.cluster.local. Port:53 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:}
I0531 08:17:31.503315 1 dns.go:171] Initialized services and endpoints from apiserver
I0531 08:17:31.503337 1 server.go:128] Setting up Healthz Handler (/readiness)
I0531 08:17:31.503343 1 server.go:133] Setting up cache handler (/cache)
I0531 08:17:31.503348 1 server.go:119] Status HTTP port 8081
I0531 08:22:31.004225 1 dns.go:264] New service: kubernetes
I0531 08:22:31.004457 1 dns.go:462] Added SRV record &{Host:kubernetes.default.svc.cluster.local. Port:443 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:}
I0531 08:22:31.004583 1 dns.go:264] New service: kube-dns
I0531 08:22:31.004634 1 dns.go:462] Added SRV record &{Host:kube-dns.kube-system.svc.cluster.local. Port:53 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:}
I0531 08:22:31.004673 1 dns.go:462] Added SRV record &{Host:kube-dns.kube-system.svc.cluster.local. Port:53 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:}
I0531 08:27:31.004378 1 dns.go:264] New service: kubernetes
I0531 08:27:31.004467 1 dns.go:462] Added SRV record &{Host:kubernetes.default.svc.cluster.local. Port:443 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:}
I0531 08:27:31.004502 1 dns.go:264] New service: kube-dns
I0531 08:27:31.004528 1 dns.go:462] Added SRV record &{Host:kube-dns.kube-system.svc.cluster.local. Port:53 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:}
I0531 08:27:31.004549 1 dns.go:462] Added SRV record &{Host:kube-dns.kube-system.svc.cluster.local. Port:53 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:}
kubectl logs kube-dns-1321724180-nnjwf --namespace=kube-system dnsmasq
I0531 08:17:36.426853 1 main.go:76] opts: {{/usr/sbin/dnsmasq [-k --cache-size=1000 --log-facility=- --server=/cluster.local/127.0.0.1#10053 --server=/in-addr.arpa/127.0.0.1#10053 --server=/in6.arpa/127.0.0.1#10053] true} /etc/k8s/dns/dnsmasq-nanny 10000000000}
I0531 08:17:36.426935 1 nanny.go:86] Starting dnsmasq [-k --cache-size=1000 --log-facility=- --server=/cluster.local/127.0.0.1#10053 --server=/in-addr.arpa/127.0.0.1#10053 --server=/in6.arpa/127.0.0.1#10053]
I0531 08:17:36.512105 1 nanny.go:111]
W0531 08:17:36.512226 1 nanny.go:112] Got EOF from stdout
I0531 08:17:36.512197 1 nanny.go:108] dnsmasq[11]: started, version 2.76 cachesize 1000
I0531 08:17:36.512253 1 nanny.go:108] dnsmasq[11]: compile time options: IPv6 GNU-getopt no-DBus no-i18n no-IDN DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth no-DNSSEC loop-detect inotify
I0531 08:17:36.512264 1 nanny.go:108] dnsmasq[11]: using nameserver 127.0.0.1#10053 for domain in6.arpa
I0531 08:17:36.512268 1 nanny.go:108] dnsmasq[11]: using nameserver 127.0.0.1#10053 for domain in-addr.arpa
I0531 08:17:36.512271 1 nanny.go:108] dnsmasq[11]: using nameserver 127.0.0.1#10053 for domain cluster.local
I0531 08:17:36.512277 1 nanny.go:108] dnsmasq[11]: reading /etc/resolv.conf
I0531 08:17:36.512281 1 nanny.go:108] dnsmasq[11]: using nameserver 127.0.0.1#10053 for domain in6.arpa
I0531 08:17:36.512284 1 nanny.go:108] dnsmasq[11]: using nameserver 127.0.0.1#10053 for domain in-addr.arpa
I0531 08:17:36.512289 1 nanny.go:108] dnsmasq[11]: using nameserver 127.0.0.1#10053 for domain cluster.local
I0531 08:17:36.512292 1 nanny.go:108] dnsmasq[11]: using nameserver 10.0.0.2#53
I0531 08:17:36.512296 1 nanny.go:108] dnsmasq[11]: read /etc/hosts - 7 addresses
md5-73d159dd1852512b9a9e7d64aed98054
kubectl logs kube-dns-1321724180-nnjwf --namespace=kube-system sidecar
ERROR: logging before flag.Parse: I0531 08:17:39.156690 1 main.go:49] Version v1.14.0-2-g38b2ca3
ERROR: logging before flag.Parse: I0531 08:17:39.156729 1 server.go:45] Starting server (options {DnsMasqPort:53 DnsMasqAddr:127.0.0.1 DnsMasqPollIntervalMs:5000 Probes:[{Label:kubedns Server:127.0.0.1:10053 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1} {Label:dnsmasq Server:127.0.0.1:53 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1}] PrometheusAddr:0.0.0.0 PrometheusPort:10054 PrometheusPath:/metrics PrometheusNamespace:kubedns})
ERROR: logging before flag.Parse: I0531 08:17:39.156811 1 dnsprobe.go:75] Starting dnsProbe {Label:kubedns Server:127.0.0.1:10053 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1}
ERROR: logging before flag.Parse: I0531 08:17:39.156881 1 dnsprobe.go:75] Starting dnsProbe {Label:dnsmasq Server:127.0.0.1:53 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1}
@ottoyiu any ideas?
@lc4ntoni-centralway This seems like a service account race condition that was fixed by #2590 and #2641 that's not part of a kops release yet. Just to confirm, are you using the tagged release of Kops 1.6?
@ottoyiu thanks for replying. I'm using the kops version shipped in brew. sha1:
5a66319cf9f943015f10ae24764e899cbe5d5445 /usr/local/Cellar/kops/1.6.0/bin/kops
Not ideal, but try building kops from master, and see if that solves your problem. Pretty sure it's no longer a problem.
@justinsb is working on a 1.6.1 release which will include the fixes.
Alright, I understand that some fix need to be included within the next release. I'll close then... thanks a lot for your support.
Hi guys! I'm running kops 1.6.2 (git-469d82d0b) and created a Kubernetes 1.6.6 cluster with calico and I'm facing the same issues. Has this been fixed yet?
Just to add more context: I'm running CentOS 7. I also noticed it seems the issue occurs on recently started nodes (e.g. after cluster-autscaler scaling up).
@lc4ntoni-centralway was your problem solved?
We're see this too but haven't been able to pin it down. Saw it most recently on a cluster created with kops v1.7.0 and k8s v1.6.2.
was it DNS that wasn't working, or were Service IPs also affected? Reason I ask is because DNS is resolved through a service IP of kube-dns'.
@ottoyiu Ok, so I tested for that. Apparently Service IPs are not working. I wrote a simple container that tries to wget my-nginx-service-ip over and over until it responds. It failed for a few requests (timeout), and about 2-3 min after starting it succeeded.
This pod was ran on a recently started node. Do these diagnostics give you any ideias?
@chrislovecnm can we reopen this since a number of us are still seeing this? I'm seeing the same behavior as @Cisneiros.
+1 for this on CentOS 7 with k8s 1.6.7. Services are continually re-registered every 5 minutes like @lc4ntoni-centralway reports. Other containers in the kube-dns pod report no errors.
I0815 13:40:58.212189 1 dns.go:264] New service: service-grafana
I0815 13:40:58.212305 1 dns.go:264] New service: kube-dns
I0815 13:40:58.212361 1 dns.go:264] New service: service-alertmanager
I0815 13:40:58.212401 1 dns.go:264] New service: kubernetes-dashboard
I0815 13:40:58.212421 1 dns.go:264] New service: kubernetes
I0815 13:40:58.212471 1 dns.go:264] New service: service-http-backend
I0815 13:40:58.212491 1 dns.go:264] New service: nginx-service
I0815 13:40:58.212511 1 dns.go:264] New service: portal-service
I0815 13:40:58.212532 1 dns.go:264] New service: service-prometheus
I0815 13:45:58.212304 1 dns.go:264] New service: nginx-service
I0815 13:45:58.212396 1 dns.go:264] New service: portal-service
I0815 13:45:58.212421 1 dns.go:264] New service: service-prometheus
I0815 13:45:58.212478 1 dns.go:264] New service: kubernetes
I0815 13:45:58.212537 1 dns.go:264] New service: service-http-backend
I0815 13:45:58.212560 1 dns.go:264] New service: service-alertmanager
I0815 13:45:58.212603 1 dns.go:264] New service: kubernetes-dashboard
I0815 13:45:58.212623 1 dns.go:264] New service: service-grafana
I0815 13:45:58.212662 1 dns.go:264] New service: kube-dns
I0815 13:50:58.212437 1 dns.go:264] New service: kubernetes-dashboard
I0815 13:50:58.212553 1 dns.go:264] New service: service-grafana
I0815 13:50:58.212602 1 dns.go:264] New service: kube-dns
I0815 13:50:58.212657 1 dns.go:264] New service: service-alertmanager
I0815 13:50:58.212697 1 dns.go:264] New service: service-prometheus
I0815 13:50:58.212740 1 dns.go:264] New service: kubernetes
I0815 13:50:58.212778 1 dns.go:264] New service: service-http-backend
I0815 13:50:58.212799 1 dns.go:264] New service: nginx-service
I0815 13:50:58.212818 1 dns.go:264] New service: portal-service
During the periods of adding a new service, DNS fails in the cluster and services cannot communicate. Networking CNI is via Calico.
Scrap that, looks like intended behaviour: https://github.com/kubernetes/dns/blob/master/pkg/dns/dns.go#L52
@Cisneiros @mikesplain I'm also seeing this behaviour on new nodes. Are you running calico using cross-subnet mode or just straight-up ipip?
@ottoyiu We're running cross subnet.
@mikesplain and do you observe it on only new nodes that come up? That's what I'm seeing right now.
@ottoyiu Yes exactly. I haven't tried rebooting a node but it's usually about 3-5 min before things are okay. I setup a daemonset to test (below). The apt-get update almost always has issues for ~5 min before finally working.
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: dns-test
spec:
template:
spec:
containers:
- name: dns-test
image: ubuntu:16.04
command:
- sh
- -c
- apt-get update && apt-get install -y curl && sleep 3600
If it's only solely happening in cross-subnet mode, then it could be AWS taking it's time after the API call to actually disable src/dst checks on the node... or it could be k8s-ec2-srcst controller that I wrote which is taking time to respond to nodes coming up and calling the AWS API. :-1: In those both cases, we'll need to mark the node as NotReady or disable scheduling until it is actually ready.
If it's happening in ipip mode as well, then well... something funky is happening with calico.
@ottoyiu Ahh yeah good idea, any ideas on how to debug this further? Let me know if there's anything I can do to help. All our clusters are cross-subnet but I can try to test it in ipip.
@mikesplain If you can test spinning up a cluster using ipip, that'll be great! I'll also do the same to try and isolate the issue.
@ottoyiu Ahh so I just setup a ipip cluster (Kops v1.7.0 & k8s v1.6.2). I wasn't able to replicate the issue.
Then I wanted to do one more test, upgrade the cluster and see what happens. After updating to k8s 1.6.8 I see the issue on a ipip cluster.
@mikesplain I think we might be seeing this while routes are still being advertised through iBGP using BIRD. Going to check the log timestamps from when the node is marked READY, and when other nodes see the routes for it.
Similarly to what I expected, the node is marked as 'READY' when the node's BIRD BGP daemon inside the calico/node has not learnt the routes from its neighbours yet. It is not until 4-5 minutes in, does it establish connection with its peers and learn their routes.
The route table when node is marked as 'READY':
admin@ip-10-67-79-106:~$ ip route
default via 10.67.64.1 dev eth0
10.67.64.0/19 dev eth0 proto kernel scope link src 10.67.79.106
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1
It is not until 4-5 minutes in, does it learn routes:
admin@ip-10-67-79-106:~$ ip route โ
default via 10.67.64.1 dev eth0 โ
10.67.64.0/19 dev eth0 proto kernel scope link src 10.67.79.106 โ
100.100.189.0/26 via 10.67.78.252 dev eth0 proto bird โ
100.117.226.128/26 via 10.67.68.58 dev eth0 proto bird โ
100.119.202.128/26 via 10.67.82.42 dev eth0 proto bird โ
blackhole 100.123.242.64/26 proto bird โ
100.125.28.128/26 via 10.67.84.204 dev eth0 proto bird โ
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1
This timing also lines up quite well with the reports of not being able to resolve hostnames using DNS or connect to anything using a service IP.
@caseydavenport have you seen this before, and if so, do you know of a workaround?
@ottoyiu route convergence via BGP should be done in seconds - even at thousand node scale - not minutes, so that seems unlikely.
If there's a timing delay in programming those routes it's more likely because either:
The only other thing I can think of is that if you've told Calico to peer with a Node that doesn't exist, it will take 90s before the routes will be programmed due to a graceful-restart timer.
I'd check for configured nodes using calicoctl get nodes and see if there are any that shouldn't be there.
You can also run calicoctl node status to get the BGP state for a given node to see if it has converged.
@caseydavenport thank you very much for your guidance :) much appreciated!
It does seem like its hitting 90s graceful-restart timer before the routes are programmed!
/ # bird -R -s /var/run/calico/bird.ctl -d -c /etc/calico/confd/config/bird.cfg โ
bird: device1: Initializing โbird: direct1: Initializing โbird: Mesh_10_67_64_205: Initializing โbird: Mesh_10_67_68_58: Initializing โbird: Mesh_10_67_76_48: Initializing โbird: Mesh_10_67_78_229: Initializing โbird: Mesh_10_67_78_252: Initializing โbird: Mesh_10_67_79_106: Initializing โbird: Mesh_10_67_82_133: Initializing โbird: Mesh_10_67_82_42: Initializing โbird: Mesh_10_67_84_204: Initializing โbird: Mesh_10_67_88_27: Initializing โ
bird: Mesh_10_67_91_228: Initializing โ
bird: device1: Starting โ
bird: device1: Connected to table master โ
bird: device1: State changed to feed โ
bird: direct1: Starting โ
bird: direct1: Connected to table master โ
bird: direct1: State changed to feed โ
bird: Mesh_10_67_64_205: Starting โ
bird: Mesh_10_67_64_205: State changed to start โ
bird: Mesh_10_67_68_58: Starting โ
bird: Mesh_10_67_68_58: State changed to start โ
bird: Mesh_10_67_76_48: Starting โ
bird: Mesh_10_67_76_48: State changed to start โ
bird: Mesh_10_67_78_229: Starting โ
bird: Mesh_10_67_78_229: State changed to start โ
bird: Mesh_10_67_78_252: Starting โ
bird: Mesh_10_67_78_252: State changed to start โ
bird: Mesh_10_67_79_106: Starting โ
bird: Mesh_10_67_79_106: State changed to start โ
bird: Mesh_10_67_82_133: Starting โ
bird: Mesh_10_67_82_133: State changed to start โ
bird: Mesh_10_67_82_42: Starting โ
bird: Mesh_10_67_82_42: State changed to start โ
bird: Mesh_10_67_84_204: Starting โ
bird: Mesh_10_67_84_204: State changed to start โ
bird: Mesh_10_67_88_27: Starting โ
bird: Mesh_10_67_88_27: State changed to start โ
bird: Mesh_10_67_91_228: Starting โ
bird: Mesh_10_67_91_228: State changed to start โ
bird: Graceful restart started โ
bird: Started โ
bird: device1: State changed to up โ
bird: direct1: State changed to up โ
bird: Mesh_10_67_82_42: Connected to table master โ
bird: Mesh_10_67_82_42: State changed to wait โ
bird: Mesh_10_67_79_106: Connected to table master โ
bird: Mesh_10_67_79_106: State changed to wait โbird: Mesh_10_67_68_58: Connected to table master โ
bird: Mesh_10_67_68_58: State changed to wait โbird: Mesh_10_67_78_252: Connected to table master โ
bird: Mesh_10_67_78_252: State changed to wait โ
โ
โ
โ
bird: Graceful restart done โ
bird: Mesh_10_67_68_58: State changed to feed โ
bird: Mesh_10_67_78_252: State changed to feed โ
bird: Mesh_10_67_79_106: State changed to feed โ
bird: Mesh_10_67_82_42: State changed to feed โ
bird: Mesh_10_67_68_58: State changed to up โ
bird: Mesh_10_67_78_252: State changed to up โ
bird: Mesh_10_67_79_106: State changed to up โ
bird: Mesh_10_67_82_42: State changed to up
Because of the ephemeral nature of AWS instances, new hosts with new hostnames come up while the old hosts still stay around in calico. This makes BIRD unhappy as it tries to peer with a node that no longer exists. :(
@mikesplain This is why you were able to replicate this behaviour after you did a rolling-update.
oyiu@cayvroyiuworkstn [~][(unknown)*] kubectl run -it --image=calico/ctl:v1.2.1 calicoctl --env CALICO_ETCD_ENDPOINTS=http://etcd-1.internal.i0:4001 get node โ
If you don't see a command prompt, try pressing enter. โ
Error attaching, falling back to logs: unable to upgrade connection: container calicoctl3 not found in pod calicoctl3-326955638-bt2cr_default โ
NAME โ
ip-10-67-64-205 โ
ip-10-67-68-58 โ
ip-10-67-70-86 โ
ip-10-67-76-48 โ
ip-10-67-78-229 โ
ip-10-67-78-252 โ
ip-10-67-79-106 โ
ip-10-67-82-133 โ
ip-10-67-82-42 โ
ip-10-67-84-204 โ
ip-10-67-88-27 โ
ip-10-67-91-228
oyiu@cayvroyiuworkstn [~][(unknown)*] kubectl get nodes โ
NAME STATUS AGE VERSION โ
ip-10-67-68-58.us-west-2.compute.internal Ready 5d v1.7.3 โ
ip-10-67-70-86.us-west-2.compute.internal Ready 46m v1.7.3 โ
ip-10-67-78-252.us-west-2.compute.internal Ready 5d v1.7.3 โ
ip-10-67-79-106.us-west-2.compute.internal Ready 1h v1.7.3 โ
ip-10-67-82-42.us-west-2.compute.internal Ready 5d v1.7.3
I'm trying to figure out what the best way is to tackle this problem. Static BGP route reflectors could help in this case. Also, a controller which listens into the kubernetes api, can watch for nodes and delete nodes from calico as they become offline.
Oh hey! I found this gem from @caseydavenport :dancing_men: :dancing_women:
https://github.com/caseydavenport/calico-node-controller
I'll see if I have some time to test it out and production-ize it.
Relevant Issue:
https://github.com/projectcalico/canal/issues/11
@ottoyiu ahh yeah great point! Great job digging into this!
I also tested this against https://github.com/kubernetes/kops/pull/3162 to make sure recent change + those in master wouldn't make a different. As expected the issue persisted. Ill take a look at the node-controller in a bit.
Most helpful comment
@chrislovecnm can we reopen this since a number of us are still seeing this? I'm seeing the same behavior as @Cisneiros.