/home/jenkins/workspace/Cilium-PR-Ginkgo-Tests-K8s/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:312
DNS entry is not ready after timeout
Expected
<*errors.errorString | 0xc420049cb0>: {
s: "Timeout reached: DNS \"redis-master.default.svc.cluster.local\" is not ready after timeout",
}
to be nil
/home/jenkins/workspace/Cilium-PR-Ginkgo-Tests-K8s/src/github.com/cilium/cilium/test/k8sT/Policies.go:723
This is happening for a variety of builds in master as well:
Marking as high priority due to the high rate of failure in the CI.
Also encountered on k8s 1.7: https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated/2634/
Looking at this a little bit. It looks like kubedns just didn't have the service in it. This is really odd and doesn't make any sense :/ We recently added some extra digs on fail and they show that the connection to the DNS server is running:
time="2018-04-24T15:07:41Z" level=debug msg="running command: dig +short redis-master.default.svc.cluster.local @10.96.0.10 | grep -v -e '^$'"
cmd: "dig +short redis-master.default.svc.cluster.local @10.96.0.10 | grep -v -e '^$'" exitCode: 1
time="2018-04-24T15:07:41Z" level=debug msg="running command: dig +tcp redis-master.default.svc.cluster.local @10.96.0.10"
cmd: "dig +tcp redis-master.default.svc.cluster.local @10.96.0.10" exitCode: 0
; <<>> DiG 9.10.3-P4-Ubuntu <<>> +tcp redis-master.default.svc.cluster.local @10.96.0.10
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 5477
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0
;; WARNING: recursion requested but not available
;; QUESTION SECTION:
;redis-master.default.svc.cluster.local. IN A
;; AUTHORITY SECTION:
cluster.local. 60 IN SOA ns.dns.cluster.local. hostmaster.cluster.local. 1524582000 28800 7200 604800 60
;; Query time: 0 msec
;; SERVER: 10.96.0.10#53(10.96.0.10)
;; WHEN: Tue Apr 24 15:07:41 UTC 2018
;; MSG SIZE rcvd: 149
both ciliums have the correct service IP and backend IP in the bfp lb map (10.97.187.12 -> 10.10.0.86) so somewhere a service object was created and we saw it.
kubedns show some errors but the latter one is during tear-down because of the failure. The former is odd but it's for kafka not redis :/
I0424 14:45:42.338095 1 dns.go:576] Could not find endpoints for service "kafka-service" in namespace "default". DNS records will be created once endpoints show up.
E0424 15:08:01.323029 1 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 10.10.0.66:49742->10.96.0.1:443: read: connection timed out
Note: I noticed that our cilium bpf lb list always has a 0.0.0.0 (0) IP listed. What is it?
@raybejjani the "0.0.0.0 (0)" entry in cilium bpf lb list is normal, it is used in the BPF to hold the number of other service backends that there are (I investigated briefly in https://github.com/cilium/cilium/issues/3905).
Questions about cause / things we need:
1) I saw the failures, and no policies and no log that say that. (It should have message)https://stackoverflow.com/questions/48264217/kube-dns-cannot-find-api-server
3) I review few cases, and no policies was applied to Kube-dns (Added PR for more debug)
4) It shows that connects and Kube-dns reply nothing.
2) PR is coming.
Any update on this issue? It's been hit again in the CI: https://jenkins.cilium.io/job/cilium-ginkgo/job/cilium/job/master/920/testReport/k8s-1/7/K8sValidatedPolicyTest_GuestBook_Examples_checks_policy_example/
Failure output:
/home/jenkins/workspace/cilium-ginkgo_cilium_master-QVKDPVJ2A6E272IRUX7X6DCB5FBLCQDKRTOMHPZS7ZECX3RM5C3A/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:373
DNS entry is not ready after timeout
Expected
<*errors.errorString | 0xc420a8c0d0>: {
s: "Timeout reached: DNS \"redis-master.default.svc.cluster.local\" is not ready after timeout",
}
to be nil
/home/jenkins/workspace/cilium-ginkgo_cilium_master-QVKDPVJ2A6E272IRUX7X6DCB5FBLCQDKRTOMHPZS7ZECX3RM5C3A/src/github.com/cilium/cilium/vendor/github.com/onsi/gomega/internal/assertion/assertion.go:35
Stdout:
STEP: Apply policy to web
STEP: Apply policy to Redis
===================== TEST FAILED =====================
cmd: kubectl get pods -o wide --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
default guestbook-5ckt5 1/1 Running 0 5m 10.15.55.223 k8s2
default redis-master-xqmwg 1/1 Running 0 5m 10.10.137.195 k8s1
default redis-slave-bm06s 1/1 Running 0 5m 10.10.120.29 k8s1
kube-system cilium-5jm29 1/1 Running 0 14m 192.168.36.11 k8s1
kube-system cilium-8qm0c 1/1 Running 0 14m 192.168.36.12 k8s2
kube-system etcd-k8s1 1/1 Running 0 20m 192.168.36.11 k8s1
kube-system kube-apiserver-k8s1 1/1 Running 0 21m 192.168.36.11 k8s1
kube-system kube-controller-manager-k8s1 1/1 Running 0 20m 192.168.36.11 k8s1
kube-system kube-dns-1992960195-zfx3q 3/3 Running 0 21m 10.10.197.21 k8s1
kube-system kube-proxy-bf0v2 1/1 Running 0 21m 192.168.36.11 k8s1
kube-system kube-proxy-p2191 1/1 Running 0 14m 192.168.36.12 k8s2
kube-system kube-scheduler-k8s1 1/1 Running 0 21m 192.168.36.11 k8s1
cmd: kubectl exec -n kube-system cilium-5jm29 -- cilium service list
ID Frontend Backend
1 10.96.0.1:443 1 => 192.168.36.11:6443
2 10.96.0.10:53 1 => 10.10.197.21:53
6 10.110.168.131:6379 1 => 10.10.137.195:6379
7 10.99.70.197:6379 1 => 10.10.120.29:6379
cmd: kubectl exec -n kube-system cilium-5jm29 -- cilium endpoint list
ENDPOINT POLICY (ingress) POLICY (egress) IDENTITY LABELS (source:key[=value]) IPv6 IPv4 STATUS
ENFORCEMENT ENFORCEMENT
9149 Enabled Disabled 53730 k8s:io.kubernetes.pod.namespace=default f00d::a0a:0:0:23bd 10.10.137.195 ready
k8s:k8s-app.guestbook=redis
k8s:role=master
k8s:zgroup=guestbook
43334 Enabled Disabled 47349 k8s:io.kubernetes.pod.namespace=default f00d::a0a:0:0:a946 10.10.120.29 ready
k8s:k8s-app.guestbook=redis
k8s:role=slave
k8s:zgroup=guestbook
55159 Disabled Disabled 52387 k8s:io.kubernetes.pod.namespace=kube-system f00d::a0a:0:0:d777 10.10.197.21 ready
k8s:k8s-app=kube-dns
61892 Disabled Disabled 4 reserved:health f00d::a0a:0:0:f1c4 10.10.131.135 ready
cmd: kubectl exec -n kube-system cilium-8qm0c -- cilium service list
ID Frontend Backend
1 10.96.0.1:443 1 => 192.168.36.11:6443
2 10.96.0.10:53 1 => 10.10.197.21:53
6 10.110.168.131:6379 1 => 10.10.137.195:6379
7 10.99.70.197:6379 1 => 10.10.120.29:6379
cmd: kubectl exec -n kube-system cilium-8qm0c -- cilium endpoint list
ENDPOINT POLICY (ingress) POLICY (egress) IDENTITY LABELS (source:key[=value]) IPv6 IPv4 STATUS
ENFORCEMENT ENFORCEMENT
55704 Enabled Disabled 35471 k8s:io.kubernetes.pod.namespace=default f00d::a0f:0:0:d998 10.15.55.223 ready
k8s:k8s-app.guestbook=web
k8s:zgroup=guestbook
55943 Disabled Disabled 4 reserved:health f00d::a0f:0:0:da87 10.15.30.136 ready
===================== EXITING REPORT GENERATION =====================
Logs:
9acb0442_K8sValidatedPolicyTest_GuestBook_Examples_checks_policy_example.zip
None, I'm afraid. I was hoping it was a CPU load issue, and would go away once we stopped running monitor. Alas, I am wrong.
Worse, the kube-dns "verbose" output seems no more verbose than before...
Ironically, we thought misconfigured podCIDR might have caused this. clearly not! (the PR fixed the podCIDR)
tl;dr kubedns sees a 5 minute no-data period that only ends after resyncPeriod ends (it's hardocded to 5 minutes). I'll make a PR to paper over this by upping our timeout but it would be better to fix the real issue :/
@aanm found another failure but this time it had DNS logs! The problem is related to kubedns timing-out on events from kube-apiserver. We still don't know why it happened.
https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated/3781/testReport/junit/k8s-1/10/K8sValidatedPolicyTest_GuestBook_Examples_checks_policy_example/
18d2fe90_K8sValidatedPolicyTest_GuestBook_Examples_checks_policy_example.zip
It happened on K8s 1.10 so it isn't restricted to 1.7.
The kubedns pod is on k8s1 with cilium-qk6bq:
ENDPOINT POLICY (ingress) POLICY (egress) IDENTITY LABELS (source:key[=value]) IPv6 IPv4 STATUS
ENFORCEMENT ENFORCEMENT
19190 Disabled Disabled 13221 k8s:io.kubernetes.pod.namespace=kube-system f00d::a0a:0:0:4af6 10.10.0.165 ready
k8s:k8s-app=kube-dns
Looking at the logs (mostly kubedns and cilium-qk6bq):
2018-06-04T11:49:10ZSTEP: Apply policy to web
time="2018-06-04T11:49:31Z" level=info msg="Performing apply action on resource '/home/vagrant/go/src/github.com/cilium/cilium/test/k8sT/manifests/guestbook-policy-web.yaml'" testName=K8sPolicyTest
time="2018-06-04T11:49:31Z" level=debug msg="running command: kubectl exec -n kube-system cilium-qk6bq -- cilium policy get -o json"
STEP: Apply policy to Redis
time="2018-06-04T11:49:40Z" level=info msg="Performing apply action on resource '/home/vagrant/go/src/github.com/cilium/cilium/test/k8sT/manifests/guestbook-policy-redis.json'" testName=K8sPolicyTest
time="2018-06-04T11:49:40Z" level=debug msg="running command: kubectl exec -n kube-system cilium-qk6bq -- cilium policy get -o json"
2018-06-04T11:49:41.167430422Z level=debug msg="Endpoint policy update skipped because no changes were needed" code=OK containerID=5ea1eaa11f endpointID=19190 endpointState=ready ipv4=10.10.0.165 ipv6="f00d::a0a:0:0:4af6" k8sPodName="kube-system:kube-dns-59d8c5f9b5-mn4cb" policyRevision=46 type=100
2018-06-04T11:53:36.07765431Z level=debug msg="BPF header file hashed (was: \"17fc49a46d4c1ce84cbb323336feef6c\")" bpfHeaderfileHash=17fc49a46d4c1ce84cbb323336feef6c containerID=5ea1eaa11f endpointID=19190 ipv4=10.10.0.165 ipv6="f00d::a0a:0:0:4af6" k8sPodName="kube-system:kube-dns-59d8c5f9b5-mn4cb" policyRevision=49
2018-06-04T11:53:36.077723112Z level=debug msg="Waiting for proxy updates to complete..." containerID=5ea1eaa11f endpointID=19190 ipv4=10.10.0.165 ipv6="f00d::a0a:0:0:4af6" k8sPodName="kube-system:kube-dns-59d8c5f9b5-mn4cb" policyRevision=49
2018-06-04T11:53:36.077767388Z level=debug msg="Wait time for proxy updates: 59.356µs" containerID=5ea1eaa11f endpointID=19190 ipv4=10.10.0.165 ipv6="f00d::a0a:0:0:4af6" k8sPodName="kube-system:kube-dns-59d8c5f9b5-mn4cb" policyRevision=49
2018-06-04T11:53:36.077812572Z level=debug msg="BPF header file unchanged, skipping BPF compilation and installation" bpfHeaderfileHash=17fc49a46d4c1ce84cbb323336feef6c containerID=5ea1eaa11f endpointID=19190 ipv4=10.10.0.165 ipv6="f00d::a0a:0:0:4af6" k8sPodName="kube-system:kube-dns-59d8c5f9b5-mn4cb" policyRevision=49
2018-06-04T11:53:36.077854667Z level=debug msg="Waiting for proxy updates to complete..." containerID=5ea1eaa11f endpointID=19190 ipv4=10.10.0.165 ipv6="f00d::a0a:0:0:4af6" k8sPodName="kube-system:kube-dns-59d8c5f9b5-mn4cb" policyRevision=49
2018-06-04T11:53:36.077897554Z level=debug msg="Wait time for proxy updates: 54.908µs" containerID=5ea1eaa11f endpointID=19190 ipv4=10.10.0.165 ipv6="f00d::a0a:0:0:4af6" k8sPodName="kube-system:kube-dns-59d8c5f9b5-mn4cb" policyRevision=49
2018-06-04T11:53:36.077961236Z level=info msg="Regeneration of BPF program has completed" buildDuration=17.821486ms containerID=5ea1eaa11f endpointID=19190 ipv4=10.10.0.165 ipv6="f00d::a0a:0:0:4af6" k8sPodName="kube-system:kube-dns-59d8c5f9b5-mn4cb" policyRevision=49
2018-06-04T11:53:36.078055779Z level=info msg="Endpoint policy recalculated" containerID=5ea1eaa11f endpointID=19190 ipv4=10.10.0.165 ipv6="f00d::a0a:0:0:4af6" k8sPodName="kube-system:kube-dns-59d8c5f9b5-mn4cb" policyRevision=49
2018-06-04T11:53:36.078106029Z level=debug msg="Completed endpoint regeneration with no pending regeneration requests" code=OK containerID=5ea1eaa11f endpointID=19190 endpointState=ready ipv4=10.10.0.165 ipv6="f00d::a0a:0:0:4af6" k8sPodName="kube-system:kube-dns-59d8c5f9b5-mn4cb" policyRevision=49 type=0
2018-06-04T11:53:36.078166286Z level=debug msg="Successfully regenerated endpoint program due to endpoint policy updated & changes were needed" code=OK containerID=5ea1eaa11f endpointID=19190 endpointState=ready ipv4=10.10.0.165 ipv6="f00d::a0a:0:0:4af6" k8sPodName="kube-system:kube-dns-59d8c5f9b5-mn4cb" policyRevision=49 type=200
2018-06-04T11:53:36.080540049Z level=debug msg="Shutting down controller for stopForUpdate; skipping stop function" name=sync-policymap-19190 subsys=controller uuid=e9b3f7d6-67ed-11e8-a0eb-080027aa233b
2018-06-04T11:54:48ZE0604 11:55:04.992153 1 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 10.10.0.165:47362->10.96.0.1:443: read: connection timed out
E0604 11:55:04.992530 1 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 10.10.0.165:47362->10.96.0.1:443: read: connection timed out
I0604 11:55:05.025262 1 dns.go:304] New service: redis-master
I0604 11:55:05.025266 1 dns.go:305] Service details: &Service{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:redis-master,GenerateName:,Namespace:default,SelfLink:/api/v1/namespaces/default/services/redis-master,UID:4b9bc3d6-67ed-11e8-9d23-080027aa233b,ResourceVersion:7500,Generation:0,CreationTimestamp:2018-06-04 11:49:10 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{k8s-app.guestbook: redis,role: master,zgroup: guestbook,},Annotations:map[string]string{kubectl.kubernetes.io/last-applied-configuration: {"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"labels":{"k8s-app.guestbook":"redis","role":"master","zgroup":"guestbook"},"name":"redis-master","namespace":"default"},"spec":{"ports":[{"port":6379,"targetPort":"redis-server"}],"selector":{"k8s-app.guestbook":"redis","role":"master"}}}
,},OwnerReferences:[],Finalizers:[],ClusterName:,},Spec:ServiceSpec{Ports:[{ TCP 6379 {1 0 redis-server} 0}],Selector:map[string]string{k8s-app.guestbook: redis,role: master,},ClusterIP:10.104.17.194,Type:ClusterIP,ExternalIPs:[],DeprecatedPublicIPs:[],SessionAffinity:None,LoadBalancerIP:,LoadBalancerSourceRanges:[],ExternalName:,},Status:ServiceStatus{LoadBalancer:LoadBalancerStatus{Ingress:[],},},}
I0604 11:55:05.025299 1 util.go:68] Constructed new DNS record: &{10.104.17.194 0 10 10 false 30 0 }, hash:aa58de7b
I0604 11:55:05.025310 1 util.go:68] Constructed new DNS record: &{redis-master.default.svc.cluster.local. 0 10 10 false 30 0 }, hash:606d0bd5
:`( I guess I was wrong then. Failure 2717 waited 8 minutes for DNS to update... so #4370 has only slowed down the failing.
Sigh... this is so bizarre. What if we try kicking / restarting kube-dns up to 3 times until it gets into a good state? We can see how long it usually takes for DNS to resolve when kube-dns is in a good state, and then only wait a bit longer than that so we don't wait for say 15 minutes before failing tests.
More issues on this one:
https://jenkins.cilium.io/job/cilium-ginkgo/job/cilium/job/master/981/testReport/junit/k8s-1/7/K8sValidatedPolicyTest_GuestBook_Examples_checks_policy_example/
https://jenkins.cilium.io/job/Ginkgo-CI-Tests-Pipeline/2730/testReport/junit/k8s-1/7/K8sValidatedUpdates_Updating_Cilium_stable_to_master/
https://jenkins.cilium.io/job/Ginkgo-CI-Tests-Pipeline/2726/testReport/junit/k8s-1/7/K8sValidatedUpdates_Updating_Cilium_stable_to_master/
I gave up trying to figure this out directly. I'm trying to fixup the monitor cpu usage so we can turn it back on in tests (that was why we turned it off, right?) Perhaps we'll see something there that makes sense :/
We could kick kubedns repeatedly. We could do it before each test even... @ianvernon Do you know why the dns manifest has cpu/memory request limits? They are quite small (there are no absolute limits so I think this just slows down how quickly it can burst, but I'm not sure).
I've just checked on the update one:
And looks like that after the update cilium service didn't is correct at all.
time="2018-06-07T06:43:00Z" level=debug msg="running command: dig +short app1-service.default.svc.cluster.local @10.96.0.10 | grep -v -e '^;'"
cmd: "dig +short app1-service.default.svc.cluster.local @10.96.0.10 | grep -v -e '^;'" exitCode: 1
time="2018-06-07T06:43:15Z" level=debug msg="running command: dig +tcp app1-service.default.svc.cluster.local @10.96.0.10"
cmd: "dig +tcp app1-service.default.svc.cluster.local @10.96.0.10" exitCode: 9
;; Connection to 10.96.0.10#53(10.96.0.10) for app1-service.default.svc.cluster.local failed: connection refused.
time="2018-06-07T06:43:16Z" level=debug msg="running command: dig +short app1-service.default.svc.cluster.local @10.96.0.10 | grep -v -e '^;'"
cmd: "dig +short app1-service.default.svc.cluster.local @10.96.0.10 | grep -v -e '^;'" exitCode: 1
time="2018-06-07T06:43:31Z" level=debug msg="running command: dig +tcp app1-service.default.svc.cluster.local @10.96.0.10"
cmd: "dig +tcp app1-service.default.svc.cluster.local @10.96.0.10" exitCode: 9
;; Connection to 10.96.0.10#53(10.96.0.10) for app1-service.default.svc.cluster.local failed: connection refused.
Around the DNS issue, the build has the problem with kube-dns:
f0c5341e_K8sValidatedPolicyTest_GuestBook_Examples_checks_policy_example.zip
Hmm in #4405 I saw a failure that's wierd.
f0c5341e_K8sValidatedUpdates_Updating_Cilium_stable_to_master (1).zip
dig returns 9, which is no reply from server. Oddly there are 15s between the dig commands. In our code they happen back to back. This is really fishy. Edit: nevermind, this was just 3 DNS attempts with 5s timeouts
time="2018-06-07T12:11:39Z" level=debug msg="running command: dig +short app1-service.default.svc.cluster.local @10.96.0.10 | grep -v -e '^;'"
cmd: "dig +short app1-service.default.svc.cluster.local @10.96.0.10 | grep -v -e '^;'" exitCode: 1 stdout:
time="2018-06-07T12:11:54Z" level=debug msg="running command: dig +tcp app1-service.default.svc.cluster.local @10.96.0.10"
cmd: "dig +tcp app1-service.default.svc.cluster.local @10.96.0.10" exitCode: 9 stdout:
but kubedns saw that specific request, actually! (probably)
I0607 12:11:39.095446 1 dns.go:637] Query for "app1-service.default.svc.cluster.local.", exact: false
I0607 12:11:39.095473 1 dns.go:866] Not a federation query: len(["app1-service" "default" "svc" "cluster" "local"]) != 4+len(["local" "cluster"])
I0607 12:11:39.095486 1 dns.go:757] Found 1 records for [local cluster svc default app1-service] in the cache
I0607 12:11:39.095490 1 dns.go:764] getRecordsForPath retval=[{Host:10.108.31.44 Port:0 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:/skydns/local/cluster/svc/default/app1-service/3561393439653439}], path=[local cluster svc default app1-service]
I0607 12:11:39.095500 1 dns.go:666] Records for app1-service.default.svc.cluster.local.: [{10.108.31.44 0 10 10 false 30 0 /skydns/local/cluster/svc/default/app1-service/3561393439653439}]
and had the DNS data already
I0607 12:11:23.712068 1 dns.go:304] New service: app1-service
Edit:
In fact, it only logs UDP DNS queries. The TCP lookups aren't showing up but the next UDP lookup from above does:
I0607 12:12:24.075910 1 dns.go:637] Query for "kubernetes.default.svc.cluster.local.", exact: false
I0607 12:12:24.075934 1 dns.go:866] Not a federation query: len(["kubernetes" "default" "svc" "cluster" "local"]) != 4+len(["local" "cluster"])
I0607 12:12:24.075952 1 dns.go:757] Found 1 records for [local cluster svc default kubernetes] in the cache
I0607 12:12:24.075960 1 dns.go:764] getRecordsForPath retval=[{Host:10.96.0.1 Port:0 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:/skydns/local/cluster/svc/default/kubernetes/6266366134663036}], path=[local cluster svc default kubernetes]
I0607 12:12:24.075974 1 dns.go:666] Records for kubernetes.default.svc.cluster.local.: [{10.96.0.1 0 10 10 false 30 0 /skydns/local/cluster/svc/default/kubernetes/6266366134663036}]
I0607 12:12:24.117509 1 dns.go:637] Query for "app1-service.default.svc.cluster.local.", exact: false
I0607 12:12:24.117532 1 dns.go:866] Not a federation query: len(["app1-service" "default" "svc" "cluster" "local"]) != 4+len(["local" "cluster"])
I0607 12:12:24.117545 1 dns.go:757] Found 1 records for [local cluster svc default app1-service] in the cache
I0607 12:12:24.117550 1 dns.go:764] getRecordsForPath retval=[{Host:10.108.31.44 Port:0 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:/skydns/local/cluster/svc/default/app1-service/3561393439653439}], path=[local cluster svc default app1-service]
I0607 12:12:24.117560 1 dns.go:666] Records for app1-service.default.svc.cluster.local.: [{10.108.31.44 0 10 10 false 30 0 /skydns/local/cluster/svc/default/app1-service/3561393439653439}]
@eloycoto we have CPU/memory metrics for the jenkins servers, right? Could they be overloaded? it's really odd that the logs just freeze :/ Did the 5GB RAM per VM change apply to the CI ones too? and could we be swapping on the host?
Just a note about this last failure:
There is a cilium agent that it is in a weird state:
Cilium rkcdv status:
KVStore: Ok etcd: 1/1 connected: http://k8s1:9732 - 3.1.0 (Leader)
ContainerRuntime: Ok
Kubernetes: Ok OK
Kubernetes APIs: ["CustomResourceDefinition", "cilium/v2::CiliumNetworkPolicy", "networking.k8s.io/v1::NetworkPolicy", "core/v1::Service", "core/v1::Endpoint", "core/v1::Node"]
Cilium: Ok OK
NodeMonitor: Listening for events on 2 CPUs with 64x4096 of shared memory
Cilium health daemon: Ok
IPv4 address pool: 6/255 allocated
IPv6 address pool: 6/4294967295 allocated
Controller Status: 30/30 healthy
Name Last success Last error Count Message
sync-to-k8s-ciliumendpoint (64189) 9s ago never 0 no error
sync-identity-to-k8s-pod (38061) 53s ago never 0 no error
sync-IPv4-identity-mapping (13949) 3m55s ago never 0 no error
sync-identity-to-k8s-pod (29381) 51s ago never 0 no error
sync-to-k8s-ciliumendpoint (29898) 5s ago never 0 no error
sync-IPv6-identity-mapping (29898) 4m6s ago never 0 no error
resolve-identity-29381 3m55s ago never 0 no error
sync-cnp-policy-status (v2 default/rule1) 8m41s ago never 0 no error
sync-IPv6-identity-mapping (13949) 3m55s ago never 0 no error
sync-IPv4-identity-mapping (29381) 3m54s ago never 0 no error
sync-IPv4-identity-mapping (64189) 3m59s ago never 0 no error
sync-IPv6-identity-mapping (38061) 3m56s ago never 0 no error
kvstore-etcd-session-renew never never 0 no error
resolve-identity-29898 4m6s ago never 0 no error
sync-identity-to-k8s-pod (64189) 58s ago never 0 no error
sync-IPv6-identity-mapping (64189) 3m59s ago never 0 no error
kvstore-lease-keepalive 4m10s ago never 0 no error
sync-IPv4-identity-mapping (29898) 4m6s ago never 0 no error
resolve-identity-38061 3m56s ago never 0 no error
sync-to-k8s-ciliumendpoint (29381) 8s ago never 0 no error
sync-to-k8s-ciliumendpoint (13949) 7s ago never 0 no error
sync-to-k8s-ciliumendpoint (38061) 8s ago never 0 no error
sync-IPv4-identity-mapping (38061) 3m56s ago never 0 no error
sync-to-k8s-ciliumendpoint-gc (k8s1) 4s ago never 0 no error
resolve-identity-64189 3m59s ago never 0 no error
resolve-identity-13949 3m55s ago never 0 no error
sync-identity-to-k8s-pod (13949) 49s ago 8m53s ago 0 no error
sync-IPv6-identity-mapping (29381) 3m54s ago never 0 no error
ipcache-bpf-garbage-collection 4m8s ago never 0 no error
sync-identity-to-k8s-pod (29898) 6s ago never 0 no error
Proxy Status: OK, ip 10.10.0.44, port-range 10000-20000
Cluster health: 0/2 reachable (2018-06-07T12:14:03Z)
Name IP Reachable Endpoints reachable
k8s1 (localhost) 192.168.36.11 true false
k8s2 192.168.36.12 true false
Can't connect to the local endpoints? And if you see the DNS is on node k8s2, so
looks like that the service is not working accross the tunnel at all.
The dig is from k8s1 to the service(that it's localted in k8s2.)
If I see the trace, on restore the endpoint I saw this message, why?
2018-06-07T12:05:07.531354464Z level=debug msg="Evaluating ingress context for source PolicyID" containerID=8f18b7c888 endpointID=64189 ingress_context="{0 0 <nil> [k8s:k8s-app=kube-dns k8s:io.kubernetes.pod.namespace=kube-system] [k8s:zgroup=testapp k8s:io.kubernetes.pod.namespace=default k8s:id=app1] []}" policyID=8333 policyRevision=0
2018-06-07T12:05:07.531396295Z level=debug msg="Evaluating egress context for source PolicyID" containerID=8f18b7c888 egress_context="{0 0 <nil> [k8s:zgroup=testapp k8s:io.kubernetes.pod.namespace=default k8s:id=app1] [k8s:k8s-app=kube-dns k8s:io.kubernetes.pod.namespace=kube-system] []}" endpointID=64189 policyID=8333 policyRevision=0
2018-06-07T12:05:07.531428783Z level=debug msg="egress verdict: denied" endpointID=64189 labels="[k8s:k8s-app=kube-dns k8s:io.kubernetes.pod.namespace=kube-system]" policyID=8333
Plus, this endpoint has a lot of logs:
[
{
"id": 16935,
"spec": {
"label-configuration": {
"user": []
},
"options": {
"Conntrack": "Enabled",
"ConntrackAccounting": "Enabled",
"ConntrackLocal": "Disabled",
"Debug": "Enabled",
"DebugLB": "Disabled",
"DropNotification": "Enabled",
"EgressPolicy": "Disabled",
"IngressPolicy": "Disabled",
"NAT46": "Disabled",
"TraceNotification": "Enabled"
}
},
"status": {
"controllers": [
{
"configuration": {
"error-retry": true,
"interval": "5m0s"
},
"name": "sync-IPv4-identity-mapping (16935)",
"status": {
"last-failure-timestamp": "0001-01-01T00:00:00.000Z",
"last-success-timestamp": "2018-06-07T12:09:51.761Z",
"success-count": 2
},
"uuid": "fbad37e4-6a4a-11e8-853b-080027aa233b"
},
{
"configuration": {
"error-retry": true,
"interval": "5m0s"
},
"name": "sync-IPv6-identity-mapping (16935)",
"status": {
"last-failure-timestamp": "0001-01-01T00:00:00.000Z",
"last-success-timestamp": "2018-06-07T12:09:51.996Z",
"success-count": 2
},
"uuid": "fbad3884-6a4a-11e8-853b-080027aa233b"
},
{
"configuration": {
"error-retry": true,
"interval": "1m0s"
},
"name": "sync-identity-to-k8s-pod (16935)",
"status": {
"last-failure-timestamp": "0001-01-01T00:00:00.000Z",
"last-success-timestamp": "2018-06-07T12:13:52.093Z",
"success-count": 10
},
"uuid": "fbad3761-6a4a-11e8-853b-080027aa233b"
},
{
"configuration": {
"error-retry": true,
"interval": "10s"
},
"name": "sync-to-k8s-ciliumendpoint (16935)",
"status": {
"last-failure-timestamp": "0001-01-01T00:00:00.000Z",
"last-success-timestamp": "2018-06-07T12:13:52.644Z",
"success-count": 55
},
"uuid": "fbac7a27-6a4a-11e8-853b-080027aa233b"
}
],
"external-identifiers": {
"container-id": "e3179b830e0db851551f778b7b8973be1210a58e4ff6cc8ee0d6a9610691d298",
"container-name": "k8s_POD_kube-dns-66cc9bd784-wrl8s_kube-system_b0a24982-6a46-11e8-8cfe-080027aa233b_0",
"pod-name": "kube-system:kube-dns-66cc9bd784-wrl8s"
},
"health": {
"bpf": "OK",
"connected": true,
"overallHealth": "OK",
"policy": "OK"
},
"identity": {
"id": 8333,
"labels": [
"k8s:io.kubernetes.pod.namespace=kube-system",
"k8s:k8s-app=kube-dns"
],
"labelsSHA256": "1b2bafe403e0d433f8307836b07ade2cf802511501f880917e128b7f663ac313"
},
"labels": {
"derived": [
"container:annotation.kubernetes.io/config.seen=2018-06-07T11:41:20.811938219Z",
"container:annotation.kubernetes.io/config.source=api",
"container:annotation.scheduler.alpha.kubernetes.io/critical-pod",
"container:io.kubernetes.container.name=POD",
"container:io.kubernetes.docker.type=podsandbox",
"container:io.kubernetes.pod.name=kube-dns-66cc9bd784-wrl8s",
"container:io.kubernetes.pod.uid=b0a24982-6a46-11e8-8cfe-080027aa233b",
"k8s:pod-template-hash=2277568340"
],
"disabled": [],
"realized": {
"user": []
},
"security-relevant": [
"k8s:io.kubernetes.pod.namespace=kube-system",
"k8s:k8s-app=kube-dns"
]
},
"log": [
{
"code": "OK",
"message": "Successfully regenerated endpoint program due to endpoint policy updated & changes were needed",
"state": "ready",
"timestamp": "2018-06-07T12:09:58Z"
}
],
"networking": {
"addressing": [
{
"ipv4": "10.10.1.131",
"ipv6": "f00d::a0a:100:0:4227"
}
],
"host-mac": "4a:9e:c7:f9:d2:ab",
"interface-index": 14,
"interface-name": "lxce3179",
"mac": "ae:54:f4:d0:31:43"
},
"policy": {
"proxy-statistics": [],
"realized": {
"allowed-egress-identities": [],
"allowed-ingress-identities": [
1
],
"build": 30,
"cidr-policy": {
"egress": [],
"ingress": []
},
"id": 8333,
"l4": {
"egress": [],
"ingress": []
},
"policy-enabled": "none",
"policy-revision": 30
},
"spec": {
"allowed-egress-identities": [],
"allowed-ingress-identities": [
1
],
"build": 30,
"cidr-policy": {
"egress": [],
"ingress": []
},
"id": 8333,
"l4": {
"egress": [],
"ingress": []
},
"policy-enabled": "none",
"policy-revision": 30
}
},
"realized": {
"label-configuration": {
"user": []
},
"options": {
"Conntrack": "Enabled",
"ConntrackAccounting": "Enabled",
"ConntrackLocal": "Disabled",
"Debug": "Enabled",
"DebugLB": "Disabled",
"DropNotification": "Enabled",
"EgressPolicy": "Disabled",
"IngressPolicy": "Disabled",
"NAT46": "Disabled",
"TraceNotification": "Enabled"
}
},
"state": "ready"
}
}
]
Timestamp Status State Message
2018-06-07T12:09:58Z OK ready Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T12:09:58Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-06-07T12:09:56Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T12:09:56Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-06-07T12:05:22Z OK ready Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T12:05:22Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-06-07T12:05:21Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T12:05:21Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-06-07T12:04:58Z OK ready Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T12:04:58Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-06-07T12:04:56Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T12:04:56Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-06-07T12:04:56Z OK ready Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T12:04:56Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-06-07T12:04:54Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T12:04:54Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-06-07T12:04:53Z OK ready Successfully regenerated endpoint program due to syncing state to host
2018-06-07T12:04:53Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-06-07T12:04:51Z OK regenerating Regenerating Endpoint BPF: syncing state to host
2018-06-07T12:04:51Z OK waiting-to-regenerate Triggering synchronous endpoint regeneration while syncing state to host
2018-06-07T12:04:51Z OK restoring Synchronizing endpoint labels with KVStore
2018-06-07T12:04:51Z OK restoring Restoring endpoint from previous cilium instance
2018-06-07T12:01:24Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T12:01:24Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-06-07T12:01:17Z OK ready Endpoint policy update skipped because no changes were needed
2018-06-07T12:01:08Z OK ready Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T12:01:08Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-06-07T12:01:08Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T12:01:05Z Warning waiting-to-regenerate Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-07T12:01:04Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-06-07T11:59:51Z OK ready Endpoint policy update skipped because no changes were needed
2018-06-07T11:59:33Z OK ready Endpoint policy update skipped because no changes were needed
2018-06-07T11:59:26Z OK ready Endpoint policy update skipped because no changes were needed
2018-06-07T11:59:26Z OK ready Endpoint policy update skipped because no changes were needed
2018-06-07T11:59:08Z OK ready Endpoint policy update skipped because no changes were needed
2018-06-07T11:59:00Z OK ready Endpoint policy update skipped because no changes were needed
2018-06-07T11:58:51Z OK ready Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T11:58:51Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:58:51Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:58:51Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-06-07T11:58:50Z OK ready Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T11:58:50Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:58:50Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:58:50Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-06-07T11:58:49Z OK ready Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T11:58:49Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:58:49Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:58:49Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-06-07T11:58:48Z OK ready Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T11:58:48Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:58:48Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:58:48Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-06-07T11:58:47Z OK ready Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T11:58:47Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:58:47Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:58:47Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-06-07T11:58:44Z OK ready Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T11:58:44Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:58:44Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:58:44Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-06-07T11:57:58Z OK ready Endpoint policy update skipped because no changes were needed
2018-06-07T11:57:58Z OK ready Endpoint policy update skipped because no changes were needed
2018-06-07T11:57:45Z OK ready Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T11:57:45Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:57:45Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:57:45Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-06-07T11:57:45Z OK ready Endpoint policy update skipped because no changes were needed
2018-06-07T11:57:43Z OK ready Endpoint policy update skipped because no changes were needed
2018-06-07T11:57:42Z OK ready Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T11:57:42Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:57:42Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:57:42Z OK waiting-to-regenerate Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T11:57:42Z Warning waiting-to-regenerate Skipped invalid state transition to ready due to: Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:57:42Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-06-07T11:57:42Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:57:42Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-06-07T11:57:42Z OK ready Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T11:57:42Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:57:42Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:57:42Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-06-07T11:57:35Z OK ready Endpoint policy update skipped because no changes were needed
2018-06-07T11:57:20Z OK ready Endpoint policy update skipped because no changes were needed
2018-06-07T11:57:12Z OK ready Endpoint policy update skipped because no changes were needed
2018-06-07T11:56:53Z OK ready Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T11:56:53Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:56:53Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:56:53Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-06-07T11:56:52Z OK ready Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T11:56:52Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:56:52Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:56:52Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-06-07T11:56:52Z OK ready Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T11:56:52Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:56:52Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:56:52Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-06-07T11:56:00Z OK ready Endpoint policy update skipped because no changes were needed
2018-06-07T11:55:42Z OK ready Endpoint policy update skipped because no changes were needed
2018-06-07T11:55:25Z OK ready Endpoint policy update skipped because no changes were needed
2018-06-07T11:55:17Z OK ready Endpoint policy update skipped because no changes were needed
2018-06-07T11:54:59Z OK ready Endpoint policy update skipped because no changes were needed
2018-06-07T11:54:01Z OK ready Endpoint policy update skipped because no changes were needed
2018-06-07T11:53:43Z OK ready Endpoint policy update skipped because no changes were needed
2018-06-07T11:53:25Z OK ready Endpoint policy update skipped because no changes were needed
2018-06-07T11:53:06Z OK ready Endpoint policy update skipped because no changes were needed
2018-06-07T11:52:45Z OK ready Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T11:52:45Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:52:45Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:52:45Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-06-07T11:52:45Z OK ready Endpoint policy update skipped because no changes were needed
2018-06-07T11:52:03Z OK ready Endpoint policy update skipped because no changes were needed
2018-06-07T11:51:45Z OK ready Endpoint policy update skipped because no changes were needed
2018-06-07T11:50:53Z OK ready Endpoint policy update skipped because no changes were needed
2018-06-07T11:50:35Z OK ready Endpoint policy update skipped because no changes were needed
2018-06-07T11:50:12Z OK ready Endpoint policy update skipped because no changes were needed
2018-06-07T11:49:46Z OK ready Endpoint policy update skipped because no changes were needed
2018-06-07T11:49:28Z OK ready Endpoint policy update skipped because no changes were needed
2018-06-07T11:49:20Z OK ready Endpoint policy update skipped because no changes were needed
2018-06-07T11:49:06Z OK ready Endpoint policy update skipped because no changes were needed
2018-06-07T11:48:02Z OK ready Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T11:48:02Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:48:02Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:48:02Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-06-07T11:48:01Z OK ready Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T11:48:01Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:48:01Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:48:01Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-06-07T11:48:01Z OK ready Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T11:48:01Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:48:01Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:48:01Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-06-07T11:47:45Z OK ready Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T11:47:45Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:47:45Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:47:45Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-06-07T11:47:44Z OK ready Successfully regenerated endpoint program due to syncing state to host
2018-06-07T11:47:44Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:47:43Z OK regenerating Regenerating Endpoint BPF: syncing state to host
2018-06-07T11:47:43Z OK waiting-to-regenerate Triggering synchronous endpoint regeneration while syncing state to host
2018-06-07T11:47:43Z OK ready Synchronizing endpoint labels with KVStore
2018-06-07T11:47:43Z Failure ready Error regenerating endpoint: can't update network policy, proxy disabled
2018-06-07T11:47:43Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:47:43Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:47:42Z Warning waiting-to-regenerate Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-07T11:47:42Z Warning waiting-to-regenerate Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-07T11:47:42Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-06-07T11:47:42Z OK restoring Restoring endpoint from previous cilium instance
2018-06-07T11:45:48Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:45:48Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-06-07T11:45:46Z OK ready Successfully regenerated endpoint program due to syncing state to host
2018-06-07T11:45:46Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:45:44Z OK regenerating Regenerating Endpoint BPF: syncing state to host
2018-06-07T11:45:44Z OK waiting-to-regenerate Triggering synchronous endpoint regeneration while syncing state to host
2018-06-07T11:45:44Z OK ready Synchronizing endpoint labels with KVStore
2018-06-07T11:45:43Z Failure ready Error regenerating endpoint: can't update network policy, proxy disabled
2018-06-07T11:45:43Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:45:43Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:45:43Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-06-07T11:45:43Z Failure ready Error regenerating endpoint: can't update network policy, proxy disabled
2018-06-07T11:45:43Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:45:43Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:45:43Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-06-07T11:45:43Z Failure ready Error regenerating endpoint: can't update network policy, proxy disabled
2018-06-07T11:45:43Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:45:43Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:45:43Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-06-07T11:45:43Z OK restoring Restoring endpoint from previous cilium instance
2018-06-07T11:43:54Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:43:53Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-06-07T11:43:53Z OK ready Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T11:43:53Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:43:53Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:43:53Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-06-07T11:43:50Z OK ready Successfully regenerated endpoint program due to syncing state to host
2018-06-07T11:43:50Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:43:49Z OK regenerating Regenerating Endpoint BPF: syncing state to host
2018-06-07T11:43:49Z OK waiting-to-regenerate Triggering synchronous endpoint regeneration while syncing state to host
2018-06-07T11:43:49Z OK ready Synchronizing endpoint labels with KVStore
2018-06-07T11:43:49Z Failure ready Error regenerating endpoint: can't update network policy, proxy disabled
2018-06-07T11:43:49Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:43:49Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:43:48Z Warning waiting-to-regenerate Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-07T11:43:48Z Warning waiting-to-regenerate Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-07T11:43:48Z Warning waiting-to-regenerate Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-07T11:43:48Z Warning waiting-to-regenerate Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-07T11:43:48Z Warning waiting-to-regenerate Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-07T11:43:48Z Warning waiting-to-regenerate Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-07T11:43:48Z Warning waiting-to-regenerate Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-07T11:43:48Z Warning waiting-to-regenerate Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-07T11:43:48Z Warning waiting-to-regenerate Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-07T11:43:48Z Warning waiting-to-regenerate Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-07T11:43:48Z Warning waiting-to-regenerate Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-07T11:43:48Z Warning waiting-to-regenerate Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-07T11:43:48Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-06-07T11:43:48Z OK restoring Restoring endpoint from previous cilium instance
2018-06-07T11:41:42Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:41:41Z Warning waiting-to-regenerate Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-07T11:41:41Z Warning waiting-to-regenerate Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-07T11:41:41Z Warning waiting-to-regenerate Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-07T11:41:41Z Warning waiting-to-regenerate Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-07T11:41:41Z Warning waiting-to-regenerate Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-07T11:41:41Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-06-07T11:41:25Z OK ready Successfully regenerated endpoint program due to updated security labels
2018-06-07T11:41:25Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:41:24Z OK regenerating Regenerating Endpoint BPF: updated security labels
2018-06-07T11:41:24Z OK waiting-to-regenerate Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T11:41:24Z Warning waiting-to-regenerate Skipped invalid state transition to ready due to: Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:41:23Z OK waiting-to-regenerate Triggering regeneration due to new identity
2018-06-07T11:41:23Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:41:23Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-06-07T11:41:22Z OK ready Successfully regenerated endpoint program due to updated security labels
2018-06-07T11:41:22Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:41:21Z OK regenerating Regenerating Endpoint BPF: updated security labels
2018-06-07T11:41:21Z OK waiting-to-regenerate Triggering regeneration due to new identity
2018-06-07T11:41:21Z OK ready Set identity for this endpoint
ID LABELS
8333 [k8s:io.kubernetes.pod.namespace=kube-system k8s:k8s-app=kube-dns]
Is the init policy again in place? Maybe is that?
The kube-dns has the request in there:
app1-service
I0607 12:05:14.942276 1 dns.go:305] Service details: &Service{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:app1-service,GenerateName:,Namespace:default,SelfLink:/api/v1/namespaces/default/services/app1-service,UID:0982d6b8-6a4b-11e8-8cfe-080027aa233b,ResourceVersion:5548,Generation:0,CreationTimestamp:2018-06-07 12:05:14 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{kubectl.kubernetes.io/last-applied-configuration: {"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"name":"app1-service","namespace":"default"},"spec":{"ports":[{"port":80}],"selector":{"id":"app1"}}}
,},OwnerReferences:[],Finalizers:[],ClusterName:,},Spec:ServiceSpec{Ports:[{ TCP 80 {0 80 } 0}],Selector:map[string]string{id: app1,},ClusterIP:10.108.31.44,Type:ClusterIP,ExternalIPs:[],DeprecatedPublicIPs:[],SessionAffinity:None,LoadBalancerIP:,LoadBalancerSourceRanges:[],ExternalName:,},Status:ServiceStatus{LoadBalancer:LoadBalancerStatus{Ingress:[],},},}
I0607 12:05:14.942339 1 util.go:68] Constructed new DNS record: &{10.108.31.44 0 10 10 false 30 0 }, hash:5a949e49
I0607 12:05:14.942353 1 util.go:68] Constructed new DNS record: &{app1-service.default.svc.cluster.local. 0 10 10 false 30 0 }, hash:e4edfa2d
I0607 12:05:16.047579 1 dns.go:482] No service for
I0607 12:05:38.604554 1 dns.go:637] Query for "app1-service.default.svc.cluster.local.", exact: false
I0607 12:05:38.604575 1 dns.go:866] Not a federation query: len(["app1-service" "default" "svc" "cluster" "local"]) != 4+len(["local" "cluster"])
I0607 12:05:38.604588 1 dns.go:757] Found 1 records for [local cluster svc default app1-service] in the cache
I0607 12:05:38.604594 1 dns.go:764] getRecordsForPath retval=[{Host:10.108.31.44 Port:0 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:/skydns/local/cluster/svc/default/app1-service/3561393439653439}], path=[local cluster svc default app1-service]
I0607 12:05:38.604617 1 dns.go:666] Records for app1-service.default.svc.cluster.local.: [{10.108.31.44 0 10 10 false 30 0 /skydns/local/cluster/svc/default/app1-service/3561393439653439}]
I0607 12:05:38.829950 1 dns.go:637] Query for "kubernetes.default.svc.cluster.local.", exact: false
I've just review&compare this one:
https://jenkins.cilium.io/job/Ginkgo-CI-Tests-Pipeline/2730/
f0c5341e_K8sValidatedUpdates_Updating_Cilium_stable_to_master.1-1.zip
And:
DIRECTION LABELS (source:key[=value]) PORT/PROTO BYTES PACKETS
Ingress reserved:host ANY 0 0
[
{
"id": 57415,
"spec": {
"label-configuration": {
"user": []
},
"options": {
"Conntrack": "Enabled",
"ConntrackAccounting": "Enabled",
"ConntrackLocal": "Disabled",
"Debug": "Enabled",
"DebugLB": "Disabled",
"DropNotification": "Enabled",
"EgressPolicy": "Disabled",
"IngressPolicy": "Disabled",
"NAT46": "Disabled",
"TraceNotification": "Enabled"
}
},
"status": {
"controllers": [
{
"configuration": {
"error-retry": true,
"interval": "5m0s"
},
"name": "sync-IPv4-identity-mapping (57415)",
"status": {
"last-failure-timestamp": "0001-01-01T00:00:00.000Z",
"last-success-timestamp": "2018-06-07T02:33:21.130Z",
"success-count": 2
},
"uuid": "72110962-69fa-11e8-a7f4-080027aa233b"
},
{
"configuration": {
"error-retry": true,
"interval": "5m0s"
},
"name": "sync-IPv6-identity-mapping (57415)",
"status": {
"last-failure-timestamp": "0001-01-01T00:00:00.000Z",
"last-success-timestamp": "2018-06-07T02:33:21.130Z",
"success-count": 2
},
"uuid": "72110cc4-69fa-11e8-a7f4-080027aa233b"
},
{
"configuration": {
"error-retry": true,
"interval": "1m0s"
},
"name": "sync-identity-to-k8s-pod (57415)",
"status": {
"last-failure-timestamp": "0001-01-01T00:00:00.000Z",
"last-success-timestamp": "2018-06-07T02:36:21.297Z",
"success-count": 9
},
"uuid": "7211037f-69fa-11e8-a7f4-080027aa233b"
},
{
"configuration": {
"error-retry": true,
"interval": "10s"
},
"name": "sync-to-k8s-ciliumendpoint (57415)",
"status": {
"last-failure-timestamp": "0001-01-01T00:00:00.000Z",
"last-success-timestamp": "2018-06-07T02:36:43.965Z",
"success-count": 51
},
"uuid": "72109305-69fa-11e8-a7f4-080027aa233b"
}
],
"external-identifiers": {
"container-id": "b774fb41d748cae3d1f454765aa16af647efe3cb66b3576b49b039dc16e24f82",
"container-name": "k8s_POD_kube-dns-1992960195-39zrf_kube-system_ffe1c5e3-69f8-11e8-bc1d-080027aa233b_0",
"pod-name": "kube-system:kube-dns-1992960195-39zrf"
},
"health": {
"bpf": "OK",
"connected": true,
"overallHealth": "OK",
"policy": "OK"
},
"identity": {
"id": 57084,
"labels": [
"k8s:io.kubernetes.pod.namespace=kube-system",
"k8s:k8s-app=kube-dns"
],
"labelsSHA256": "1b2bafe403e0d433f8307836b07ade2cf802511501f880917e128b7f663ac313"
},
"labels": {
"derived": [
"container:annotation.kubernetes.io/config.seen=2018-06-07T02:26:12.326781187Z",
"container:annotation.kubernetes.io/config.source=api",
container:annotation.kubernetes.io/created-by={
"apiVersion": "v1",
"kind": "SerializedReference",
"reference": {
"apiVersion": "extensions",
"kind": "ReplicaSet",
"name": "kube-dns-1992960195",
"namespace": "kube-system",
"resourceVersion": "370",
"uid": "ffe02ef6-69f8-11e8-bc1d-080027aa233b"
}
}
,
"container:annotation.scheduler.alpha.kubernetes.io/critical-pod",
"container:io.kubernetes.container.name=POD",
"container:io.kubernetes.docker.type=podsandbox",
"container:io.kubernetes.pod.name=kube-dns-1992960195-39zrf",
"container:io.kubernetes.pod.uid=ffe1c5e3-69f8-11e8-bc1d-080027aa233b",
"k8s:pod-template-hash=1992960195"
],
"disabled": [],
"realized": {
"user": []
},
"security-relevant": [
"k8s:io.kubernetes.pod.namespace=kube-system",
"k8s:k8s-app=kube-dns"
]
},
"log": [
{
"code": "OK",
"message": "Successfully regenerated endpoint program due to endpoint policy updated & changes were needed",
"state": "ready",
"timestamp": "2018-06-07T02:33:27Z"
}
],
"networking": {
"addressing": [
{
"ipv4": "10.10.1.179",
"ipv6": "f00d::a0a:100:0:e047"
}
],
"host-mac": "66:ec:48:9e:67:8d",
"interface-index": 14,
"interface-name": "lxcb774f",
"mac": "ca:f3:53:c2:88:2b"
},
"policy": {
"proxy-statistics": [],
"realized": {
"allowed-egress-identities": [],
"allowed-ingress-identities": [
1
],
"build": 7,
"cidr-policy": {
"egress": [],
"ingress": []
},
"id": 57084,
"l4": {
"egress": [],
"ingress": []
},
"policy-enabled": "none",
"policy-revision": 7
},
"spec": {
"allowed-egress-identities": [],
"allowed-ingress-identities": [
1
],
"build": 7,
"cidr-policy": {
"egress": [],
"ingress": []
},
"id": 57084,
"l4": {
"egress": [],
"ingress": []
},
"policy-enabled": "none",
"policy-revision": 7
}
},
"realized": {
"label-configuration": {
"user": []
},
"options": {
"Conntrack": "Enabled",
"ConntrackAccounting": "Enabled",
"ConntrackLocal": "Disabled",
"Debug": "Enabled",
"DebugLB": "Disabled",
"DropNotification": "Enabled",
"EgressPolicy": "Disabled",
"IngressPolicy": "Disabled",
"NAT46": "Disabled",
"TraceNotification": "Enabled"
}
},
"state": "ready"
}
}
]
BPF Policy enabled: Reserved:host
Policy get: none.
Logs:
2018-06-07T02:33:26.019881734Z level=debug msg="egress verdict: undecided" endpointID=57415 labels="[k8s:id=app3 k8s:io.kubernetes.pod.namespace=default k8s:zgroup=testapp]" policyID=35564
2018-06-07T02:33:26.019883807Z level=debug msg="Evaluating context for destination PolicyID" containerID=cilium-loc ctx="{0 0 <nil> [reserved:health] [reserved:host] []}" endpointID=29898 policyID=host policyRevision=6
2018-06-07T02:33:26.01988918Z level=debug msg="Evaluating ingress context for source PolicyID" containerID=b774fb41d7 endpointID=57415 ingress_context="{0 0 <nil> [k8s:io.kubernetes.pod.namespace=kube-system k8s:k8s-app=kube-dns] [k8s:k8s-app=kube-dns k8s:io.kubernetes.pod.namespace=kube-system] []}" policyID=57084 policyRevision=6
2018-06-07T02:33:26.019891684Z level=debug msg="Evaluating egress context for source PolicyID" containerID=b774fb41d7 egress_context="{0 0 <nil> [k8s:k8s-app=kube-dns k8s:io.kubernetes.pod.namespace=kube-system] [k8s:io.kubernetes.pod.namespace=kube-system k8s:k8s-app=kube-dns] []}" endpointID=57415 policyID=57084 policyRevision=6
2018-06-07T02:33:26.019893985Z level=debug msg="egress verdict: undecided" endpointID=57415 labels="[k8s:io.kubernetes.pod.namespace=kube-system k8s:k8s-app=kube-dns]" policyID=57084
2018-06-07T02:33:26.019896457Z level=debug msg="Evaluating ingress context for source PolicyID" containerID=b774fb41d7 endpointID=57415 ingress_context="{0 0 <nil> [reserved:host] [k8s:k8s-app=kube-dns k8s:io.kubernetes.pod.namespace=kube-system] []}" policyID=host policyRevision=6
2018-06-07T02:33:26.019912356Z level=debug msg="Evaluating egress context for source PolicyID" containerID=b774fb41d7 egress_context="{0 0 <nil> [k8s:k8s-app=kube-dns k8s:io.kubernetes.pod.namespace=kube-system] [reserved:host] []}" endpointID=57415 policyID=host policyRevision=6
2018-06-07T02:33:26.019914571Z level=debug msg="egress verdict: undecided" endpointID=57415 labels="[reserved:host]" policyID=host
2018-06-07T02:33:26.019916556Z level=debug msg="Evaluating ingress context for source PolicyID" containerID=b774fb41d7 endpointID=57415 ingress_context="{0 0 <nil> [reserved:world] [k8s:k8s-app=kube-dns k8s:io.kubernetes.pod.namespace=kube-system] []}" policyID=world policyRevision=6
2018-06-07T02:33:26.019930435Z level=debug msg="Evaluating egress context for source PolicyID" containerID=b774fb41d7 egress_context="{0 0 <nil> [k8s:k8s-app=kube-dns k8s:io.kubernetes.pod.namespace=kube-system] [reserved:world] []}" endpointID=57415 policyID=world policyRevision=6
2018-06-07T02:33:26.019932762Z level=debug msg="Evaluating context for destination PolicyID" containerID=cilium-loc ctx="{0 0 <nil> [reserved:health] [reserved:world] []}" endpointID=29898 policyID=world policyRevision=6
2018-06-07T02:33:26.019934872Z level=debug msg="egress verdict: undecided" endpointID=57415 labels="[reserved:world]" policyID=world
And this can explain why not always fail, because if the kube-dns is on k8s2, the k8s1 dig cannot get access to that.
So, I finally saw a test fail on https://github.com/cilium/cilium/pull/4406 (this tested undoing the MTU changes). The errors are a little different, and the branch doesn't include adde5de7053849bf0475c9047f40b6d20a04316c but that isn't terribly relevant.
tl;dr we seem to be missing routes to clusterIPs, so kubedns can't connect to the kubeapiserver. Cilium can, however, so I'm not sure I interpreted this correctly.
f0c5341e_K8sValidatedUpdates_Updating_Cilium_stable_to_master.zip
The most interesting error is from kubedns.log
E0612 19:23:34.370631 143 reflector.go:201] k8s.io/dns/pkg/dns/dns.go:192: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?resourceVersion=0: dial tcp 10.96.0.1:443: getsockopt: no route to host
Note that kubedns is running on k8s2 but the apiserver is on k8s1. This is what @eloycoto pointed out above. I had hoped it was because of the MTU stuff but it isn't. It's also worth noting that kubedns is the only pod with restartCount >0. kubedns: 5, sidecar:4 and dnsmasq: 4.
The cilium pod on k8s2 is cilium-490kg.
Neither cilium has the kubedns in the lb map.
The cilium-health report is also bad, like what @eloycoto saw:
Probe time: 2018-06-12T19:23:25Z
Nodes:
k8s2 (localhost):
Host connectivity to 192.168.36.12:
ICMP: OK, RTT=409.49µs
HTTP via L3: OK, RTT=329.567µs
Endpoint connectivity to 10.10.1.172:
ICMP: Connection timed out
HTTP via L3: Connection timed out
k8s1:
Host connectivity to 192.168.36.11:
ICMP: OK, RTT=662.562µs
HTTP via L3: OK, RTT=1.031337ms
Endpoint connectivity to 10.10.0.172:
ICMP: Connection timed out
HTTP via L3: Connection timed out
Oddly, cilium doesn't know about kubedns
ENDPOINT POLICY (ingress) POLICY (egress) IDENTITY LABELS (source:key[=value]) IPv6 IPv4 STATUS
ENFORCEMENT ENFORCEMENT
29898 Disabled Disabled 25244 reserved:health f00d::a0a:100:0:74ca 10.10.1.172 ready
We seem to be missing the kube-apiserver cluster IP route (clusterIP: 10.96.0.1, backend: 192.168.36.11). I'm not sure why. The podCIDRs for the nodes are 10.10.0.0/24 and 10.10.1.0/24.
default via 10.0.2.2 dev enp0s3 proto dhcp src 10.0.2.15 metric 100
10.0.2.0/24 dev enp0s3 proto kernel scope link src 10.0.2.15
10.0.2.2 dev enp0s3 proto dhcp scope link src 10.0.2.15 metric 100
10.10.0.0/24 via 10.10.1.44 dev cilium_host src 10.10.1.44
10.10.1.0/24 via 10.10.1.44 dev cilium_host src 10.10.1.44
10.10.1.44 dev cilium_host scope link
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1
172.28.128.0/24 dev enp0s9 proto kernel scope link src 172.28.128.5
192.168.36.0/24 dev enp0s8 proto kernel scope link src 192.168.36.12
default via 10.0.2.2 dev enp0s3 proto dhcp src 10.0.2.15 metric 100
10.0.2.0/24 dev enp0s3 proto kernel scope link src 10.0.2.15
10.0.2.2 dev enp0s3 proto dhcp scope link src 10.0.2.15 metric 100
10.10.0.0/24 via 10.10.0.44 dev cilium_host src 10.10.0.44
10.10.0.44 dev cilium_host scope link
10.10.1.0/24 via 10.10.0.44 dev cilium_host src 10.10.0.44
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1
172.28.128.0/24 dev enp0s9 proto kernel scope link src 172.28.128.4
192.168.36.0/24 dev enp0s8 proto kernel scope link src 192.168.36.11
Looking through the logs for 172.17 we see
$ grep -E '172.17' cilium-490kg-logs.log
2018-06-12T19:15:20.565200637Z level=debug msg="Considering route" route="{Ifindex: 5 Dst: 172.17.0.0/16 Src: 172.17.0.1 Gw: <nil> Flags: [] Table: 254}"
2018-06-12T19:15:20.568913369Z level=debug msg="Considering removing iptables rule" obj="-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE"
2018-06-12T19:15:20.568923164Z level=debug msg="Considering removing iptables rule" obj="-A POSTROUTING -s 172.17.0.2/32 -d 172.17.0.2/32 -p tcp -m tcp --dport 5000 -j MASQUERADE"
2018-06-12T19:15:20.576692598Z level=debug msg="Considering removing iptables rule" obj="-A DOCKER ! -i docker0 -p tcp -m tcp --dport 5000 -j DNAT --to-destination 172.17.0.2:5000"
2018-06-12T19:15:20.594023875Z level=debug msg="Considering removing iptables rule" obj="-A DOCKER -d 172.17.0.2/32 ! -i docker0 -o docker0 -p tcp -m tcp --dport 5000 -j ACCEPT"
2018-06-12T19:15:21.990182549Z level=debug msg="Skipping network because of gateway mismatch" endpointID=e78e9f05f07d266375ad9fc29fb94197bc492bed7dd83cda5c1a8c2362d4414e obj="&{<nil> [] [] ae6644e1a2b3b1c607dc92bd29bdc96e5196b32414d1784d383063693cc4548b e78e9f05f07d266375ad9fc29fb94197bc492bed7dd83cda5c1a8c2362d4414e 172.17.0.1 172.17.0.2 16 0 02:42:ac:11:00:02}" subsys=containerd-watcher
and looking for that message
pkg/workloads/docker.go:380: scopedLog.WithField(logfields.Object, contNetwork).Debug("Skipping network because of gateway mismatch")
Oddly, that happens in getCiliumIPv6 and that is used when deciding to ignore workloads in pkg/workloads/docker.go.
10.96 is mentioned in a similar context:
2018-06-12T19:15:20.577310354Z level=debug msg="Considering removing iptables rule" obj="-A KUBE-SERVICES ! -s 10.10.0.0/16 -d 10.96.0.1/32 -p tcp -m comment --comment \"default/kubernetes:https cluster IP\" -m tcp --dport 443 -j KUBE-MARK-MASQ"
2018-06-12T19:15:20.577394862Z level=debug msg="Considering removing iptables rule" obj="-A KUBE-SERVICES -d 10.96.0.1/32 -p tcp -m comment --comment \"default/kubernetes:https cluster IP\" -m tcp --dport 443 -j KUBE-SVC-NPX46M4PTMTKRN6Y"
but has no skip/remove log
@eloycoto could this be a conflict between a hardcoded podCIDR CLI option to cilium and a derived value from k8s?
Today I got this one in my laptop:
Traces:
K8s1:k8s1-monitor.log
k8s2:k8s2-monitor.log
First Sysdymp:
cilium-sysdump-20180615-131816.zip
Second Sysdump:
cilium-sysdump-20180615-134008.zip
If you see the list of pods, the 10.10.1.66 IP is duplicated:
Every 2.0s: kubectl get pods --all-namespaces -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
default testclient-gddnl 1/1 Running 0 20m 10.10.0.32 k8s1
default testclient-zz5p6 1/1 Running 0 20m 10.10.1.32 k8s2
default testds-2mvvn 1/1 Running 0 20m 10.10.1.66 k8s2
default testds-4fsvg 1/1 Running 0 20m 10.10.0.66 k8s1
kube-system cilium-5zhqx 1/1 Running 0 24m 192.168.36.11 k8s1
kube-system cilium-db2mx 1/1 Running 0 24m 192.168.36.12 k8s2
kube-system etcd-k8s1 1/1 Running 1 1h 192.168.36.11 k8s1
kube-system kube-apiserver-k8s1 1/1 Running 1 1h 192.168.36.11 k8s1
kube-system kube-controller-manager-k8s1 1/1 Running 1 1h 192.168.36.11 k8s1
kube-system kube-dns-59d8c5f9b5-7m225 2/3 CrashLoopBackOff 39 1h 10.10.1.66 k8s2
kube-system kube-proxy-5tfmw 1/1 Running 1 1h 192.168.36.12 k8s2
kube-system kube-proxy-zclzl 1/1 Running 1 1h 192.168.36.11 k8s1
kube-system kube-scheduler-k8s1 1/1 Running 1 1h 192.168.36.11 k8s1
1) Clean the CT and it didn't work.
2) 2/3 kube-dns are created, but is not in the endpoint list.
3) Testing connectivity from k8s1 to k8s2 and it works (It's not a vbox problem)
4) Test connectivity from k8s2 to k8s1 api and it's works:
vagrant@k8s1:~$ kubectl exec -ti testclient-zz5p6 -- curl -v https://10.96.0.1:443 --insecure
* Rebuilt URL to: https://10.96.0.1:443/
* Trying 10.96.0.1...
* TCP_NODELAY set
* Connected to 10.96.0.1 (10.96.0.1) port 443 (#0)
* ALPN, offering http/1.1
* ignoring certificate verify locations due to disabled peer verification
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Request CERT (13):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Certificate (11):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Client hello (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS change cipher, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use http/1.1
* Server certificate:
* subject: CN=kube-apiserver
* start date: Jun 15 11:54:08 2018 GMT
* expire date: Jun 15 11:54:08 2019 GMT
* issuer: CN=kubernetes
* SSL certificate verify result: unable to get local issuer certificate (20), continuing anyway.
> GET / HTTP/1.1
> Host: 10.96.0.1
> User-Agent: curl/7.59.0
> Accept: */*
>
< HTTP/1.1 403 Forbidden
< Content-Type: application/json
< X-Content-Type-Options: nosniff
< Date: Fri, 15 Jun 2018 13:38:58 GMT
< Content-Length: 233
<
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {
},
"status": "Failure",
"message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
"reason": "Forbidden",
"details": {
},
"code": 403
* Connection #0 to host 10.96.0.1 left intact
}vagrant@k8s1:~$
The message from kube-dns is still that can't reach 10.96.0.1:443
After seen this one in my laptop, I think that it's not a datapath issue at all. The endpoint trying to regenerate and always fails, the lookup to endpoint manager is always nil, and ep=nil:
More logs:
vagrant@k8s2:~$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
bcd18fdf5852 8a7739f672b4 "/sidecar --v=2 --lo…" About a minute ago Exited (2) 25 seconds ago k8s_sidecar_kube-dns-59d8c5f9b5-7m225_kube-system_f5ca6dd2-7092-11e8-8380-0800274fedc6_22
cf262dd02909 55ffe31ac578 "/kube-dns --domain=…" 4 minutes ago Exited (255) 3 minutes ago k8s_kubedns_kube-dns-59d8c5f9b5-7m225_kube-system_f5ca6dd2-7092-11e8-8380-0800274fedc6_17
7dde69c74df2 6816817d9dce "/dnsmasq-nanny -v=5…" 4 minutes ago Exited (137) 2 minutes ago k8s_dnsmasq_kube-dns-59d8c5f9b5-7m225_kube-system_f5ca6dd2-7092-11e8-8380-0800274fedc6_20
77433b91049c cilium/demo-client "sleep 1000h" About an hour ago Up About an hour k8s_web_testclient-zz5p6_default_47b0a741-709e-11e8-b8cf-0800274fedc6_0
d6207c6a6cb1 cilium/demo-httpd "httpd-foreground" About an hour ago Up About an hour k8s_web_testds-2mvvn_default_47a7ac1a-709e-11e8-b8cf-0800274fedc6_0
ba7749c5d5c2 k8s.gcr.io/pause-amd64:3.1 "/pause" About an hour ago Up About an hour k8s_POD_testclient-zz5p6_default_47b0a741-709e-11e8-b8cf-0800274fedc6_0
ede995f28e11 k8s.gcr.io/pause-amd64:3.1 "/pause" About an hour ago Up About an hour k8s_POD_testds-2mvvn_default_47a7ac1a-709e-11e8-b8cf-0800274fedc6_0
26f6f9b98f5a cilium/cilium "cilium-agent --debu…" About an hour ago Up About an hour k8s_cilium-agent_cilium-db2mx_kube-system_baa559da-709d-11e8-b8cf-0800274fedc6_0
33db50b78b93 8c811b4aec35 "sh -c 'if [ \"${CLEA…" About an hour ago Exited (0) About an hour ago k8s_clean-cilium-state_cilium-db2mx_kube-system_baa559da-709d-11e8-b8cf-0800274fedc6_0
2f38b0d512df k8s.gcr.io/pause-amd64:3.1 "/pause" About an hour ago Up About an hour k8s_POD_cilium-db2mx_kube-system_baa559da-709d-11e8-b8cf-0800274fedc6_0
8738c05c7b2f k8s.gcr.io/pause-amd64:3.1 "/pause" 2 hours ago Up 2 hours k8s_POD_kube-dns-59d8c5f9b5-7m225_kube-system_f5ca6dd2-7092-11e8-8380-0800274fedc6_1
ee091faeaca0 4261d315109d "/usr/local/bin/kube…" 2 hours ago Up 2 hours k8s_kube-proxy_kube-proxy-5tfmw_kube-system_f56a5b30-7096-11e8-8380-0800274fedc6_1
bf7736dcf6d9 k8s.gcr.io/pause-amd64:3.1 "/pause" 2 hours ago Up 2 hours k8s_POD_kube-proxy-5tfmw_kube-system_f56a5b30-7096-11e8-8380-0800274fedc6_1
e439b65e1f1a k8s.gcr.io/kube-proxy-amd64 "/usr/local/bin/kube…" 2 hours ago Exited (255) 2 hours ago k8s_kube-proxy_kube-proxy-5tfmw_kube-system_f56a5b30-7096-11e8-8380-0800274fedc6_0
a4965ede214f k8s.gcr.io/pause-amd64:3.1 "/pause" 2 hours ago Exited (255) 2 hours ago k8s_POD_kube-proxy-5tfmw_kube-system_f56a5b30-7096-11e8-8380-0800274fedc6_0
c6fb55808bad registry:2 "/entrypoint.sh /etc…" 9 days ago Up 2 hours 0.0.0.0:5000->5000/tcp registry
vagrant@k8s2:~$ docker inspect 55ffe31ac578
[
{
"Id": "sha256:55ffe31ac5789896865300074e914ff3b699050663a096f757f8ecc1d71c8aad",
"RepoTags": [
"k8s.gcr.io/k8s-dns-kube-dns-amd64:1.14.10"
],
"RepoDigests": [
"k8s.gcr.io/k8s-dns-kube-dns-amd64@sha256:b99fc3eee2a9f052f7eb4cc00f15eb12fc405fa41019baa2d6b79847ae7284a8"
],
"Parent": "",
"Comment": "",
"Created": "2018-04-16T21:58:45.624829023Z",
"Container": "afa9187fd1abd28f56f4b399251f7154bfb82ec30957bb42d6e8ff5ac7f1b612",
"ContainerConfig": {
"Hostname": "afa9187fd1ab",
"Domainname": "",
"User": "",
"AttachStdin": false,
"AttachStdout": false,
"AttachStderr": false,
"Tty": false,
"OpenStdin": false,
"StdinOnce": false,
"Env": [
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
],
"Cmd": [
"/bin/sh",
"-c",
"#(nop) ",
"ENTRYPOINT [\"/kube-dns\"]"
],
"ArgsEscaped": true,
"Image": "sha256:f2b4439632672cabd98dfb861fb1ee420568876ec7ccf23c63928462329f8f1e",
"Volumes": null,
"WorkingDir": "",
"Entrypoint": [
"/kube-dns"
],
"OnBuild": null,
"Labels": {}
},
"DockerVersion": "17.12.0-ce",
"Author": "Bowei Du <[email protected]>",
"Config": {
"Hostname": "",
"Domainname": "",
"User": "",
"AttachStdin": false,
"AttachStdout": false,
"AttachStderr": false,
"Tty": false,
"OpenStdin": false,
"StdinOnce": false,
"Env": [
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
],
"Cmd": null,
"ArgsEscaped": true,
"Image": "sha256:f2b4439632672cabd98dfb861fb1ee420568876ec7ccf23c63928462329f8f1e",
"Volumes": null,
"WorkingDir": "",
"Entrypoint": [
"/kube-dns"
],
"OnBuild": null,
"Labels": null
},
"Architecture": "amd64",
"Os": "linux",
"Size": 49549457,
"VirtualSize": 49549457,
"GraphDriver": {
"Data": {
"LowerDir": "/var/lib/docker/overlay2/05f015ed83f0947b896a8865466c9d84969b0054e2d71b1820375cf1818061da/diff",
"MergedDir": "/var/lib/docker/overlay2/120258e0074e6dc222397c7de56dc01f5c89662d3ed4061c4b79b5e4fd97fc18/merged",
"UpperDir": "/var/lib/docker/overlay2/120258e0074e6dc222397c7de56dc01f5c89662d3ed4061c4b79b5e4fd97fc18/diff",
"WorkDir": "/var/lib/docker/overlay2/120258e0074e6dc222397c7de56dc01f5c89662d3ed4061c4b79b5e4fd97fc18/work"
},
"Name": "overlay2"
},
"RootFS": {
"Type": "layers",
"Layers": [
"sha256:cd7100a72410606589a54b932cabd804a17f9ae5b42a1882bd56d263e02b6215",
"sha256:1a0adecd63853941c0436bb4e1299c2ebebfd89f0b25c2e93c1cd7d518d12b89"
]
},
"Metadata": {
"LastTagTime": "0001-01-01T00:00:00Z"
}
}
]
Similar issue at https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated/4188/testReport/junit/k8s-1/10/K8sValidatedUpdates_Updating_Cilium_stable_to_master/
Timeout reached: DNS \"app1-service.default.svc.cluster.local\" is not ready after timeout
@jrajahalme I noticed you rebase after the failure. When it failed, did your branch include https://github.com/cilium/cilium/pull/4512 ?
Similarly, hit the same message as @jrajahalme on PR #4587, rebased against master today:
https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated/4358/testReport/junit/k8s-1/7/K8sValidatedUpdates_Updating_Cilium_stable_to_master/
f0c5341e_K8sValidatedUpdates_Updating_Cilium_stable_to_master.zip