Cilium: CI: 1.9.K8sValidatedPolicyTest - DNS \"redis-master.default.svc.cluster.local\" is not ready after timeout

Created on 24 Apr 2018  Â·  27Comments  Â·  Source: cilium/cilium

https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-K8s/139/testReport/junit/k8s-1/9/K8sValidatedPolicyTest_GuestBook_Examples_checks_policy_example/

/home/jenkins/workspace/Cilium-PR-Ginkgo-Tests-K8s/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:312
DNS entry is not ready after timeout
Expected
    <*errors.errorString | 0xc420049cb0>: {
        s: "Timeout reached: DNS \"redis-master.default.svc.cluster.local\" is not ready after timeout",
    }
to be nil
/home/jenkins/workspace/Cilium-PR-Ginkgo-Tests-K8s/src/github.com/cilium/cilium/test/k8sT/Policies.go:723

PR https://github.com/cilium/cilium/pull/3874

test_results_Cilium-PR-Ginkgo-Tests-K8s_139.zip

areCI kinbuCI priorithigh

All 27 comments

Looking at this a little bit. It looks like kubedns just didn't have the service in it. This is really odd and doesn't make any sense :/ We recently added some extra digs on fail and they show that the connection to the DNS server is running:

time="2018-04-24T15:07:41Z" level=debug msg="running command: dig +short redis-master.default.svc.cluster.local @10.96.0.10 | grep -v -e '^$'"
cmd: "dig +short redis-master.default.svc.cluster.local @10.96.0.10 | grep -v -e '^$'" exitCode: 1 

time="2018-04-24T15:07:41Z" level=debug msg="running command: dig +tcp redis-master.default.svc.cluster.local @10.96.0.10"
cmd: "dig +tcp redis-master.default.svc.cluster.local @10.96.0.10" exitCode: 0 

; <<>> DiG 9.10.3-P4-Ubuntu <<>> +tcp redis-master.default.svc.cluster.local @10.96.0.10
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 5477
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;redis-master.default.svc.cluster.local.    IN A

;; AUTHORITY SECTION:
cluster.local.      60  IN  SOA ns.dns.cluster.local. hostmaster.cluster.local. 1524582000 28800 7200 604800 60

;; Query time: 0 msec
;; SERVER: 10.96.0.10#53(10.96.0.10)
;; WHEN: Tue Apr 24 15:07:41 UTC 2018
;; MSG SIZE  rcvd: 149

both ciliums have the correct service IP and backend IP in the bfp lb map (10.97.187.12 -> 10.10.0.86) so somewhere a service object was created and we saw it.

kubedns show some errors but the latter one is during tear-down because of the failure. The former is odd but it's for kafka not redis :/

I0424 14:45:42.338095       1 dns.go:576] Could not find endpoints for service "kafka-service" in namespace "default". DNS records will be created once endpoints show up.
E0424 15:08:01.323029       1 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 10.10.0.66:49742->10.96.0.1:443: read: connection timed out

Note: I noticed that our cilium bpf lb list always has a 0.0.0.0 (0) IP listed. What is it?

@raybejjani the "0.0.0.0 (0)" entry in cilium bpf lb list is normal, it is used in the BPF to hold the number of other service backends that there are (I investigated briefly in https://github.com/cilium/cilium/issues/3905).

Questions about cause / things we need:

  • Loss of connectivity between kube-dns and kube-apiserver?
  • Get more logs from kube-dns (verbose option, most verbose possible?)
  • Potentially fallout from drop-all? Is this from Cilium behavior?
  • Do monitor logs show drops from kube-dns to kube-apiserver?

1) I saw the failures, and no policies and no log that say that. (It should have message)https://stackoverflow.com/questions/48264217/kube-dns-cannot-find-api-server
3) I review few cases, and no policies was applied to Kube-dns (Added PR for more debug)
4) It shows that connects and Kube-dns reply nothing.

2) PR is coming.

Any update on this issue? It's been hit again in the CI: https://jenkins.cilium.io/job/cilium-ginkgo/job/cilium/job/master/920/testReport/k8s-1/7/K8sValidatedPolicyTest_GuestBook_Examples_checks_policy_example/

Failure output:

/home/jenkins/workspace/cilium-ginkgo_cilium_master-QVKDPVJ2A6E272IRUX7X6DCB5FBLCQDKRTOMHPZS7ZECX3RM5C3A/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:373
DNS entry is not ready after timeout
Expected
    <*errors.errorString | 0xc420a8c0d0>: {
        s: "Timeout reached: DNS \"redis-master.default.svc.cluster.local\" is not ready after timeout",
    }
to be nil
/home/jenkins/workspace/cilium-ginkgo_cilium_master-QVKDPVJ2A6E272IRUX7X6DCB5FBLCQDKRTOMHPZS7ZECX3RM5C3A/src/github.com/cilium/cilium/vendor/github.com/onsi/gomega/internal/assertion/assertion.go:35

Stdout:

STEP: Apply policy to web
STEP: Apply policy to Redis
===================== TEST FAILED =====================
cmd: kubectl get pods -o wide --all-namespaces
NAMESPACE     NAME                           READY     STATUS    RESTARTS   AGE       IP              NODE
default       guestbook-5ckt5                1/1       Running   0          5m        10.15.55.223    k8s2
default       redis-master-xqmwg             1/1       Running   0          5m        10.10.137.195   k8s1
default       redis-slave-bm06s              1/1       Running   0          5m        10.10.120.29    k8s1
kube-system   cilium-5jm29                   1/1       Running   0          14m       192.168.36.11   k8s1
kube-system   cilium-8qm0c                   1/1       Running   0          14m       192.168.36.12   k8s2
kube-system   etcd-k8s1                      1/1       Running   0          20m       192.168.36.11   k8s1
kube-system   kube-apiserver-k8s1            1/1       Running   0          21m       192.168.36.11   k8s1
kube-system   kube-controller-manager-k8s1   1/1       Running   0          20m       192.168.36.11   k8s1
kube-system   kube-dns-1992960195-zfx3q      3/3       Running   0          21m       10.10.197.21    k8s1
kube-system   kube-proxy-bf0v2               1/1       Running   0          21m       192.168.36.11   k8s1
kube-system   kube-proxy-p2191               1/1       Running   0          14m       192.168.36.12   k8s2
kube-system   kube-scheduler-k8s1            1/1       Running   0          21m       192.168.36.11   k8s1

cmd: kubectl exec -n kube-system cilium-5jm29 -- cilium service list
ID   Frontend              Backend                   
1    10.96.0.1:443         1 => 192.168.36.11:6443   
2    10.96.0.10:53         1 => 10.10.197.21:53      
6    10.110.168.131:6379   1 => 10.10.137.195:6379   
7    10.99.70.197:6379     1 => 10.10.120.29:6379    

cmd: kubectl exec -n kube-system cilium-5jm29 -- cilium endpoint list
ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])                   IPv6                 IPv4            STATUS   
           ENFORCEMENT        ENFORCEMENT                                                                                                     
9149       Enabled            Disabled          53730      k8s:io.kubernetes.pod.namespace=default       f00d::a0a:0:0:23bd   10.10.137.195   ready   
                                                           k8s:k8s-app.guestbook=redis                                                                
                                                           k8s:role=master                                                                            
                                                           k8s:zgroup=guestbook                                                                       
43334      Enabled            Disabled          47349      k8s:io.kubernetes.pod.namespace=default       f00d::a0a:0:0:a946   10.10.120.29    ready   
                                                           k8s:k8s-app.guestbook=redis                                                                
                                                           k8s:role=slave                                                                             
                                                           k8s:zgroup=guestbook                                                                       
55159      Disabled           Disabled          52387      k8s:io.kubernetes.pod.namespace=kube-system   f00d::a0a:0:0:d777   10.10.197.21    ready   
                                                           k8s:k8s-app=kube-dns                                                                       
61892      Disabled           Disabled          4          reserved:health                               f00d::a0a:0:0:f1c4   10.10.131.135   ready   

cmd: kubectl exec -n kube-system cilium-8qm0c -- cilium service list
ID   Frontend              Backend                   
1    10.96.0.1:443         1 => 192.168.36.11:6443   
2    10.96.0.10:53         1 => 10.10.197.21:53      
6    10.110.168.131:6379   1 => 10.10.137.195:6379   
7    10.99.70.197:6379     1 => 10.10.120.29:6379    

cmd: kubectl exec -n kube-system cilium-8qm0c -- cilium endpoint list
ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])               IPv6                 IPv4           STATUS   
           ENFORCEMENT        ENFORCEMENT                                                                                                
55704      Enabled            Disabled          35471      k8s:io.kubernetes.pod.namespace=default   f00d::a0f:0:0:d998   10.15.55.223   ready   
                                                           k8s:k8s-app.guestbook=web                                                             
                                                           k8s:zgroup=guestbook                                                                  
55943      Disabled           Disabled          4          reserved:health                           f00d::a0f:0:0:da87   10.15.30.136   ready   

===================== EXITING REPORT GENERATION =====================

Logs:
9acb0442_K8sValidatedPolicyTest_GuestBook_Examples_checks_policy_example.zip

None, I'm afraid. I was hoping it was a CPU load issue, and would go away once we stopped running monitor. Alas, I am wrong.
Worse, the kube-dns "verbose" output seems no more verbose than before...

yay! another! https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated/3732/testReport/junit/k8s-1/7/K8sValidatedPolicyTest_GuestBook_Examples_checks_policy_example/

Ironically, we thought misconfigured podCIDR might have caused this. clearly not! (the PR fixed the podCIDR)

tl;dr kubedns sees a 5 minute no-data period that only ends after resyncPeriod ends (it's hardocded to 5 minutes). I'll make a PR to paper over this by upping our timeout but it would be better to fix the real issue :/

@aanm found another failure but this time it had DNS logs! The problem is related to kubedns timing-out on events from kube-apiserver. We still don't know why it happened.
https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated/3781/testReport/junit/k8s-1/10/K8sValidatedPolicyTest_GuestBook_Examples_checks_policy_example/

18d2fe90_K8sValidatedPolicyTest_GuestBook_Examples_checks_policy_example.zip

It happened on K8s 1.10 so it isn't restricted to 1.7.

The kubedns pod is on k8s1 with cilium-qk6bq:

ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])                   IPv6                 IPv4          STATUS   
           ENFORCEMENT        ENFORCEMENT                                                                                                   
19190      Disabled           Disabled          13221      k8s:io.kubernetes.pod.namespace=kube-system   f00d::a0a:0:0:4af6   10.10.0.165   ready   
                                                           k8s:k8s-app=kube-dns

Looking at the logs (mostly kubedns and cilium-qk6bq):

  • the redis-master service has a creationtimestamp of 2018-06-04T11:49:10Z
  • we apply the web policy
STEP: Apply policy to web
time="2018-06-04T11:49:31Z" level=info msg="Performing apply action on resource '/home/vagrant/go/src/github.com/cilium/cilium/test/k8sT/manifests/guestbook-policy-web.yaml'" testName=K8sPolicyTest
time="2018-06-04T11:49:31Z" level=debug msg="running command: kubectl exec -n kube-system cilium-qk6bq -- cilium policy get -o json"
  • We apply the redis policy
STEP: Apply policy to Redis
time="2018-06-04T11:49:40Z" level=info msg="Performing apply action on resource '/home/vagrant/go/src/github.com/cilium/cilium/test/k8sT/manifests/guestbook-policy-redis.json'" testName=K8sPolicyTest
time="2018-06-04T11:49:40Z" level=debug msg="running command: kubectl exec -n kube-system cilium-qk6bq -- cilium policy get -o json"
  • EP 19190 skips a regenerate that isn't related to it (probably the redis policy):
2018-06-04T11:49:41.167430422Z level=debug msg="Endpoint policy update skipped because no changes were needed" code=OK containerID=5ea1eaa11f endpointID=19190 endpointState=ready ipv4=10.10.0.165 ipv6="f00d::a0a:0:0:4af6" k8sPodName="kube-system:kube-dns-59d8c5f9b5-mn4cb" policyRevision=46 type=100
  • EP 19190 sees its final regenerate at
2018-06-04T11:53:36.07765431Z level=debug msg="BPF header file hashed (was: \"17fc49a46d4c1ce84cbb323336feef6c\")" bpfHeaderfileHash=17fc49a46d4c1ce84cbb323336feef6c containerID=5ea1eaa11f endpointID=19190 ipv4=10.10.0.165 ipv6="f00d::a0a:0:0:4af6" k8sPodName="kube-system:kube-dns-59d8c5f9b5-mn4cb" policyRevision=49
2018-06-04T11:53:36.077723112Z level=debug msg="Waiting for proxy updates to complete..." containerID=5ea1eaa11f endpointID=19190 ipv4=10.10.0.165 ipv6="f00d::a0a:0:0:4af6" k8sPodName="kube-system:kube-dns-59d8c5f9b5-mn4cb" policyRevision=49
2018-06-04T11:53:36.077767388Z level=debug msg="Wait time for proxy updates: 59.356µs" containerID=5ea1eaa11f endpointID=19190 ipv4=10.10.0.165 ipv6="f00d::a0a:0:0:4af6" k8sPodName="kube-system:kube-dns-59d8c5f9b5-mn4cb" policyRevision=49
2018-06-04T11:53:36.077812572Z level=debug msg="BPF header file unchanged, skipping BPF compilation and installation" bpfHeaderfileHash=17fc49a46d4c1ce84cbb323336feef6c containerID=5ea1eaa11f endpointID=19190 ipv4=10.10.0.165 ipv6="f00d::a0a:0:0:4af6" k8sPodName="kube-system:kube-dns-59d8c5f9b5-mn4cb" policyRevision=49
2018-06-04T11:53:36.077854667Z level=debug msg="Waiting for proxy updates to complete..." containerID=5ea1eaa11f endpointID=19190 ipv4=10.10.0.165 ipv6="f00d::a0a:0:0:4af6" k8sPodName="kube-system:kube-dns-59d8c5f9b5-mn4cb" policyRevision=49
2018-06-04T11:53:36.077897554Z level=debug msg="Wait time for proxy updates: 54.908µs" containerID=5ea1eaa11f endpointID=19190 ipv4=10.10.0.165 ipv6="f00d::a0a:0:0:4af6" k8sPodName="kube-system:kube-dns-59d8c5f9b5-mn4cb" policyRevision=49
2018-06-04T11:53:36.077961236Z level=info msg="Regeneration of BPF program has completed" buildDuration=17.821486ms containerID=5ea1eaa11f endpointID=19190 ipv4=10.10.0.165 ipv6="f00d::a0a:0:0:4af6" k8sPodName="kube-system:kube-dns-59d8c5f9b5-mn4cb" policyRevision=49
2018-06-04T11:53:36.078055779Z level=info msg="Endpoint policy recalculated" containerID=5ea1eaa11f endpointID=19190 ipv4=10.10.0.165 ipv6="f00d::a0a:0:0:4af6" k8sPodName="kube-system:kube-dns-59d8c5f9b5-mn4cb" policyRevision=49
2018-06-04T11:53:36.078106029Z level=debug msg="Completed endpoint regeneration with no pending regeneration requests" code=OK containerID=5ea1eaa11f endpointID=19190 endpointState=ready ipv4=10.10.0.165 ipv6="f00d::a0a:0:0:4af6" k8sPodName="kube-system:kube-dns-59d8c5f9b5-mn4cb" policyRevision=49 type=0
2018-06-04T11:53:36.078166286Z level=debug msg="Successfully regenerated endpoint program due to endpoint policy updated & changes were needed" code=OK containerID=5ea1eaa11f endpointID=19190 endpointState=ready ipv4=10.10.0.165 ipv6="f00d::a0a:0:0:4af6" k8sPodName="kube-system:kube-dns-59d8c5f9b5-mn4cb" policyRevision=49 type=200
2018-06-04T11:53:36.080540049Z level=debug msg="Shutting down controller for stopForUpdate; skipping stop function" name=sync-policymap-19190 subsys=controller uuid=e9b3f7d6-67ed-11e8-a0eb-080027aa233b
  • we fail the test at 2018-06-04T11:54:48Z
  • The kubedns timeout happens at (see https://github.com/kubernetes/dns/blob/80fdd88276adba36a87c4f424b66fdf37cd7c9a8/pkg/dns/dns.go#L53)
E0604 11:55:04.992153       1 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 10.10.0.165:47362->10.96.0.1:443: read: connection timed out
E0604 11:55:04.992530       1 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 10.10.0.165:47362->10.96.0.1:443: read: connection timed out
  • Right after the timeout, the DNS server finally receives the redis-master service information:
I0604 11:55:05.025262       1 dns.go:304] New service: redis-master
I0604 11:55:05.025266       1 dns.go:305] Service details: &Service{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:redis-master,GenerateName:,Namespace:default,SelfLink:/api/v1/namespaces/default/services/redis-master,UID:4b9bc3d6-67ed-11e8-9d23-080027aa233b,ResourceVersion:7500,Generation:0,CreationTimestamp:2018-06-04 11:49:10 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{k8s-app.guestbook: redis,role: master,zgroup: guestbook,},Annotations:map[string]string{kubectl.kubernetes.io/last-applied-configuration: {"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"labels":{"k8s-app.guestbook":"redis","role":"master","zgroup":"guestbook"},"name":"redis-master","namespace":"default"},"spec":{"ports":[{"port":6379,"targetPort":"redis-server"}],"selector":{"k8s-app.guestbook":"redis","role":"master"}}}
,},OwnerReferences:[],Finalizers:[],ClusterName:,},Spec:ServiceSpec{Ports:[{ TCP 6379 {1 0 redis-server} 0}],Selector:map[string]string{k8s-app.guestbook: redis,role: master,},ClusterIP:10.104.17.194,Type:ClusterIP,ExternalIPs:[],DeprecatedPublicIPs:[],SessionAffinity:None,LoadBalancerIP:,LoadBalancerSourceRanges:[],ExternalName:,},Status:ServiceStatus{LoadBalancer:LoadBalancerStatus{Ingress:[],},},}
I0604 11:55:05.025299       1 util.go:68] Constructed new DNS record: &{10.104.17.194 0 10 10  false 30 0  }, hash:aa58de7b
I0604 11:55:05.025310       1 util.go:68] Constructed new DNS record: &{redis-master.default.svc.cluster.local. 0 10 10  false 30 0  }, hash:606d0bd5

:`( I guess I was wrong then. Failure 2717 waited 8 minutes for DNS to update... so #4370 has only slowed down the failing.

Sigh... this is so bizarre. What if we try kicking / restarting kube-dns up to 3 times until it gets into a good state? We can see how long it usually takes for DNS to resolve when kube-dns is in a good state, and then only wait a bit longer than that so we don't wait for say 15 minutes before failing tests.

I gave up trying to figure this out directly. I'm trying to fixup the monitor cpu usage so we can turn it back on in tests (that was why we turned it off, right?) Perhaps we'll see something there that makes sense :/

We could kick kubedns repeatedly. We could do it before each test even... @ianvernon Do you know why the dns manifest has cpu/memory request limits? They are quite small (there are no absolute limits so I think this just slows down how quickly it can burst, but I'm not sure).

I've just checked on the update one:

https://jenkins.cilium.io/job/Ginkgo-CI-Tests-Pipeline/2732/testReport/junit/k8s-1/10/K8sValidatedUpdates_Updating_Cilium_stable_to_master/

And looks like that after the update cilium service didn't is correct at all.

time="2018-06-07T06:43:00Z" level=debug msg="running command: dig +short app1-service.default.svc.cluster.local @10.96.0.10 | grep -v -e '^;'"
cmd: "dig +short app1-service.default.svc.cluster.local @10.96.0.10 | grep -v -e '^;'" exitCode: 1 

time="2018-06-07T06:43:15Z" level=debug msg="running command: dig +tcp app1-service.default.svc.cluster.local @10.96.0.10"
cmd: "dig +tcp app1-service.default.svc.cluster.local @10.96.0.10" exitCode: 9 
 ;; Connection to 10.96.0.10#53(10.96.0.10) for app1-service.default.svc.cluster.local failed: connection refused.

time="2018-06-07T06:43:16Z" level=debug msg="running command: dig +short app1-service.default.svc.cluster.local @10.96.0.10 | grep -v -e '^;'"
cmd: "dig +short app1-service.default.svc.cluster.local @10.96.0.10 | grep -v -e '^;'" exitCode: 1 

time="2018-06-07T06:43:31Z" level=debug msg="running command: dig +tcp app1-service.default.svc.cluster.local @10.96.0.10"
cmd: "dig +tcp app1-service.default.svc.cluster.local @10.96.0.10" exitCode: 9 
 ;; Connection to 10.96.0.10#53(10.96.0.10) for app1-service.default.svc.cluster.local failed: connection refused.

Hmm in #4405 I saw a failure that's wierd.

f0c5341e_K8sValidatedUpdates_Updating_Cilium_stable_to_master (1).zip

dig returns 9, which is no reply from server. Oddly there are 15s between the dig commands. In our code they happen back to back. This is really fishy. Edit: nevermind, this was just 3 DNS attempts with 5s timeouts

time="2018-06-07T12:11:39Z" level=debug msg="running command: dig +short app1-service.default.svc.cluster.local @10.96.0.10 | grep -v -e '^;'"
cmd: "dig +short app1-service.default.svc.cluster.local @10.96.0.10 | grep -v -e '^;'" exitCode: 1 stdout:

time="2018-06-07T12:11:54Z" level=debug msg="running command: dig +tcp app1-service.default.svc.cluster.local @10.96.0.10"
cmd: "dig +tcp app1-service.default.svc.cluster.local @10.96.0.10" exitCode: 9 stdout:

but kubedns saw that specific request, actually! (probably)

I0607 12:11:39.095446       1 dns.go:637] Query for "app1-service.default.svc.cluster.local.", exact: false
I0607 12:11:39.095473       1 dns.go:866] Not a federation query: len(["app1-service" "default" "svc" "cluster" "local"]) != 4+len(["local" "cluster"])
I0607 12:11:39.095486       1 dns.go:757] Found 1 records for [local cluster svc default app1-service] in the cache
I0607 12:11:39.095490       1 dns.go:764] getRecordsForPath retval=[{Host:10.108.31.44 Port:0 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:/skydns/local/cluster/svc/default/app1-service/3561393439653439}], path=[local cluster svc default app1-service]
I0607 12:11:39.095500       1 dns.go:666] Records for app1-service.default.svc.cluster.local.: [{10.108.31.44 0 10 10  false 30 0  /skydns/local/cluster/svc/default/app1-service/3561393439653439}]

and had the DNS data already

I0607 12:11:23.712068       1 dns.go:304] New service: app1-service

Edit:
In fact, it only logs UDP DNS queries. The TCP lookups aren't showing up but the next UDP lookup from above does:

I0607 12:12:24.075910       1 dns.go:637] Query for "kubernetes.default.svc.cluster.local.", exact: false
I0607 12:12:24.075934       1 dns.go:866] Not a federation query: len(["kubernetes" "default" "svc" "cluster" "local"]) != 4+len(["local" "cluster"])
I0607 12:12:24.075952       1 dns.go:757] Found 1 records for [local cluster svc default kubernetes] in the cache
I0607 12:12:24.075960       1 dns.go:764] getRecordsForPath retval=[{Host:10.96.0.1 Port:0 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:/skydns/local/cluster/svc/default/kubernetes/6266366134663036}], path=[local cluster svc default kubernetes]
I0607 12:12:24.075974       1 dns.go:666] Records for kubernetes.default.svc.cluster.local.: [{10.96.0.1 0 10 10  false 30 0  /skydns/local/cluster/svc/default/kubernetes/6266366134663036}]
I0607 12:12:24.117509       1 dns.go:637] Query for "app1-service.default.svc.cluster.local.", exact: false
I0607 12:12:24.117532       1 dns.go:866] Not a federation query: len(["app1-service" "default" "svc" "cluster" "local"]) != 4+len(["local" "cluster"])
I0607 12:12:24.117545       1 dns.go:757] Found 1 records for [local cluster svc default app1-service] in the cache
I0607 12:12:24.117550       1 dns.go:764] getRecordsForPath retval=[{Host:10.108.31.44 Port:0 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:/skydns/local/cluster/svc/default/app1-service/3561393439653439}], path=[local cluster svc default app1-service]
I0607 12:12:24.117560       1 dns.go:666] Records for app1-service.default.svc.cluster.local.: [{10.108.31.44 0 10 10  false 30 0  /skydns/local/cluster/svc/default/app1-service/3561393439653439}]

@eloycoto we have CPU/memory metrics for the jenkins servers, right? Could they be overloaded? it's really odd that the logs just freeze :/ Did the 5GB RAM per VM change apply to the CI ones too? and could we be swapping on the host?

Just a note about this last failure:

There is a cilium agent that it is in a weird state:

Cilium rkcdv status:

KVStore:                Ok   etcd: 1/1 connected: http://k8s1:9732 - 3.1.0 (Leader)
ContainerRuntime:       Ok
Kubernetes:             Ok   OK
Kubernetes APIs:        ["CustomResourceDefinition", "cilium/v2::CiliumNetworkPolicy", "networking.k8s.io/v1::NetworkPolicy", "core/v1::Service", "core/v1::Endpoint", "core/v1::Node"]
Cilium:                 Ok   OK
NodeMonitor:            Listening for events on 2 CPUs with 64x4096 of shared memory
Cilium health daemon:   Ok
IPv4 address pool:      6/255 allocated
IPv6 address pool:      6/4294967295 allocated
Controller Status:      30/30 healthy
  Name                                        Last success   Last error   Count   Message
  sync-to-k8s-ciliumendpoint (64189)          9s ago         never        0       no error
  sync-identity-to-k8s-pod (38061)            53s ago        never        0       no error
  sync-IPv4-identity-mapping (13949)          3m55s ago      never        0       no error
  sync-identity-to-k8s-pod (29381)            51s ago        never        0       no error
  sync-to-k8s-ciliumendpoint (29898)          5s ago         never        0       no error
  sync-IPv6-identity-mapping (29898)          4m6s ago       never        0       no error
  resolve-identity-29381                      3m55s ago      never        0       no error
  sync-cnp-policy-status (v2 default/rule1)   8m41s ago      never        0       no error
  sync-IPv6-identity-mapping (13949)          3m55s ago      never        0       no error
  sync-IPv4-identity-mapping (29381)          3m54s ago      never        0       no error
  sync-IPv4-identity-mapping (64189)          3m59s ago      never        0       no error
  sync-IPv6-identity-mapping (38061)          3m56s ago      never        0       no error
  kvstore-etcd-session-renew                  never          never        0       no error
  resolve-identity-29898                      4m6s ago       never        0       no error
  sync-identity-to-k8s-pod (64189)            58s ago        never        0       no error
  sync-IPv6-identity-mapping (64189)          3m59s ago      never        0       no error
  kvstore-lease-keepalive                     4m10s ago      never        0       no error
  sync-IPv4-identity-mapping (29898)          4m6s ago       never        0       no error
  resolve-identity-38061                      3m56s ago      never        0       no error
  sync-to-k8s-ciliumendpoint (29381)          8s ago         never        0       no error
  sync-to-k8s-ciliumendpoint (13949)          7s ago         never        0       no error
  sync-to-k8s-ciliumendpoint (38061)          8s ago         never        0       no error
  sync-IPv4-identity-mapping (38061)          3m56s ago      never        0       no error
  sync-to-k8s-ciliumendpoint-gc (k8s1)        4s ago         never        0       no error
  resolve-identity-64189                      3m59s ago      never        0       no error
  resolve-identity-13949                      3m55s ago      never        0       no error
  sync-identity-to-k8s-pod (13949)            49s ago        8m53s ago    0       no error
  sync-IPv6-identity-mapping (29381)          3m54s ago      never        0       no error
  ipcache-bpf-garbage-collection              4m8s ago       never        0       no error
  sync-identity-to-k8s-pod (29898)            6s ago         never        0       no error
Proxy Status:   OK, ip 10.10.0.44, port-range 10000-20000
Cluster health:      0/2 reachable   (2018-06-07T12:14:03Z)
  Name               IP              Reachable   Endpoints reachable
  k8s1 (localhost)   192.168.36.11   true        false
  k8s2               192.168.36.12   true        false

Can't connect to the local endpoints? And if you see the DNS is on node k8s2, so
looks like that the service is not working accross the tunnel at all.

The dig is from k8s1 to the service(that it's localted in k8s2.)

If I see the trace, on restore the endpoint I saw this message, why?

2018-06-07T12:05:07.531354464Z level=debug msg="Evaluating ingress context for source PolicyID" containerID=8f18b7c888 endpointID=64189 ingress_context="{0 0 <nil> [k8s:k8s-app=kube-dns k8s:io.kubernetes.pod.namespace=kube-system] [k8s:zgroup=testapp k8s:io.kubernetes.pod.namespace=default k8s:id=app1] []}" policyID=8333 policyRevision=0
2018-06-07T12:05:07.531396295Z level=debug msg="Evaluating egress context for source PolicyID" containerID=8f18b7c888 egress_context="{0 0 <nil> [k8s:zgroup=testapp k8s:io.kubernetes.pod.namespace=default k8s:id=app1] [k8s:k8s-app=kube-dns k8s:io.kubernetes.pod.namespace=kube-system] []}" endpointID=64189 policyID=8333 policyRevision=0
2018-06-07T12:05:07.531428783Z level=debug msg="egress verdict: denied" endpointID=64189 labels="[k8s:k8s-app=kube-dns k8s:io.kubernetes.pod.namespace=kube-system]" policyID=8333

Plus, this endpoint has a lot of logs:

Endpoint Get 16935

[
  {
    "id": 16935,
    "spec": {
      "label-configuration": {
        "user": []
      },
      "options": {
        "Conntrack": "Enabled",
        "ConntrackAccounting": "Enabled",
        "ConntrackLocal": "Disabled",
        "Debug": "Enabled",
        "DebugLB": "Disabled",
        "DropNotification": "Enabled",
        "EgressPolicy": "Disabled",
        "IngressPolicy": "Disabled",
        "NAT46": "Disabled",
        "TraceNotification": "Enabled"
      }
    },
    "status": {
      "controllers": [
        {
          "configuration": {
            "error-retry": true,
            "interval": "5m0s"
          },
          "name": "sync-IPv4-identity-mapping (16935)",
          "status": {
            "last-failure-timestamp": "0001-01-01T00:00:00.000Z",
            "last-success-timestamp": "2018-06-07T12:09:51.761Z",
            "success-count": 2
          },
          "uuid": "fbad37e4-6a4a-11e8-853b-080027aa233b"
        },
        {
          "configuration": {
            "error-retry": true,
            "interval": "5m0s"
          },
          "name": "sync-IPv6-identity-mapping (16935)",
          "status": {
            "last-failure-timestamp": "0001-01-01T00:00:00.000Z",
            "last-success-timestamp": "2018-06-07T12:09:51.996Z",
            "success-count": 2
          },
          "uuid": "fbad3884-6a4a-11e8-853b-080027aa233b"
        },
        {
          "configuration": {
            "error-retry": true,
            "interval": "1m0s"
          },
          "name": "sync-identity-to-k8s-pod (16935)",
          "status": {
            "last-failure-timestamp": "0001-01-01T00:00:00.000Z",
            "last-success-timestamp": "2018-06-07T12:13:52.093Z",
            "success-count": 10
          },
          "uuid": "fbad3761-6a4a-11e8-853b-080027aa233b"
        },
        {
          "configuration": {
            "error-retry": true,
            "interval": "10s"
          },
          "name": "sync-to-k8s-ciliumendpoint (16935)",
          "status": {
            "last-failure-timestamp": "0001-01-01T00:00:00.000Z",
            "last-success-timestamp": "2018-06-07T12:13:52.644Z",
            "success-count": 55
          },
          "uuid": "fbac7a27-6a4a-11e8-853b-080027aa233b"
        }
      ],
      "external-identifiers": {
        "container-id": "e3179b830e0db851551f778b7b8973be1210a58e4ff6cc8ee0d6a9610691d298",
        "container-name": "k8s_POD_kube-dns-66cc9bd784-wrl8s_kube-system_b0a24982-6a46-11e8-8cfe-080027aa233b_0",
        "pod-name": "kube-system:kube-dns-66cc9bd784-wrl8s"
      },
      "health": {
        "bpf": "OK",
        "connected": true,
        "overallHealth": "OK",
        "policy": "OK"
      },
      "identity": {
        "id": 8333,
        "labels": [
          "k8s:io.kubernetes.pod.namespace=kube-system",
          "k8s:k8s-app=kube-dns"
        ],
        "labelsSHA256": "1b2bafe403e0d433f8307836b07ade2cf802511501f880917e128b7f663ac313"
      },
      "labels": {
        "derived": [
          "container:annotation.kubernetes.io/config.seen=2018-06-07T11:41:20.811938219Z",
          "container:annotation.kubernetes.io/config.source=api",
          "container:annotation.scheduler.alpha.kubernetes.io/critical-pod",
          "container:io.kubernetes.container.name=POD",
          "container:io.kubernetes.docker.type=podsandbox",
          "container:io.kubernetes.pod.name=kube-dns-66cc9bd784-wrl8s",
          "container:io.kubernetes.pod.uid=b0a24982-6a46-11e8-8cfe-080027aa233b",
          "k8s:pod-template-hash=2277568340"
        ],
        "disabled": [],
        "realized": {
          "user": []
        },
        "security-relevant": [
          "k8s:io.kubernetes.pod.namespace=kube-system",
          "k8s:k8s-app=kube-dns"
        ]
      },
      "log": [
        {
          "code": "OK",
          "message": "Successfully regenerated endpoint program due to endpoint policy updated & changes were needed",
          "state": "ready",
          "timestamp": "2018-06-07T12:09:58Z"
        }
      ],
      "networking": {
        "addressing": [
          {
            "ipv4": "10.10.1.131",
            "ipv6": "f00d::a0a:100:0:4227"
          }
        ],
        "host-mac": "4a:9e:c7:f9:d2:ab",
        "interface-index": 14,
        "interface-name": "lxce3179",
        "mac": "ae:54:f4:d0:31:43"
      },
      "policy": {
        "proxy-statistics": [],
        "realized": {
          "allowed-egress-identities": [],
          "allowed-ingress-identities": [
            1
          ],
          "build": 30,
          "cidr-policy": {
            "egress": [],
            "ingress": []
          },
          "id": 8333,
          "l4": {
            "egress": [],
            "ingress": []
          },
          "policy-enabled": "none",
          "policy-revision": 30
        },
        "spec": {
          "allowed-egress-identities": [],
          "allowed-ingress-identities": [
            1
          ],
          "build": 30,
          "cidr-policy": {
            "egress": [],
            "ingress": []
          },
          "id": 8333,
          "l4": {
            "egress": [],
            "ingress": []
          },
          "policy-enabled": "none",
          "policy-revision": 30
        }
      },
      "realized": {
        "label-configuration": {
          "user": []
        },
        "options": {
          "Conntrack": "Enabled",
          "ConntrackAccounting": "Enabled",
          "ConntrackLocal": "Disabled",
          "Debug": "Enabled",
          "DebugLB": "Disabled",
          "DropNotification": "Enabled",
          "EgressPolicy": "Disabled",
          "IngressPolicy": "Disabled",
          "NAT46": "Disabled",
          "TraceNotification": "Enabled"
        }
      },
      "state": "ready"
    }
  }
]


Endpoint Log 16935

Timestamp              Status    State                   Message
2018-06-07T12:09:58Z   OK        ready                   Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T12:09:58Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-06-07T12:09:56Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T12:09:56Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-06-07T12:05:22Z   OK        ready                   Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T12:05:22Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-06-07T12:05:21Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T12:05:21Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-06-07T12:04:58Z   OK        ready                   Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T12:04:58Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-06-07T12:04:56Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T12:04:56Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-06-07T12:04:56Z   OK        ready                   Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T12:04:56Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-06-07T12:04:54Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T12:04:54Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-06-07T12:04:53Z   OK        ready                   Successfully regenerated endpoint program due to syncing state to host
2018-06-07T12:04:53Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-06-07T12:04:51Z   OK        regenerating            Regenerating Endpoint BPF: syncing state to host
2018-06-07T12:04:51Z   OK        waiting-to-regenerate   Triggering synchronous endpoint regeneration while syncing state to host
2018-06-07T12:04:51Z   OK        restoring               Synchronizing endpoint labels with KVStore
2018-06-07T12:04:51Z   OK        restoring               Restoring endpoint from previous cilium instance
2018-06-07T12:01:24Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T12:01:24Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-06-07T12:01:17Z   OK        ready                   Endpoint policy update skipped because no changes were needed
2018-06-07T12:01:08Z   OK        ready                   Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T12:01:08Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-06-07T12:01:08Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T12:01:05Z   Warning   waiting-to-regenerate   Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-07T12:01:04Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-06-07T11:59:51Z   OK        ready                   Endpoint policy update skipped because no changes were needed
2018-06-07T11:59:33Z   OK        ready                   Endpoint policy update skipped because no changes were needed
2018-06-07T11:59:26Z   OK        ready                   Endpoint policy update skipped because no changes were needed
2018-06-07T11:59:26Z   OK        ready                   Endpoint policy update skipped because no changes were needed
2018-06-07T11:59:08Z   OK        ready                   Endpoint policy update skipped because no changes were needed
2018-06-07T11:59:00Z   OK        ready                   Endpoint policy update skipped because no changes were needed
2018-06-07T11:58:51Z   OK        ready                   Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T11:58:51Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:58:51Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:58:51Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-06-07T11:58:50Z   OK        ready                   Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T11:58:50Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:58:50Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:58:50Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-06-07T11:58:49Z   OK        ready                   Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T11:58:49Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:58:49Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:58:49Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-06-07T11:58:48Z   OK        ready                   Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T11:58:48Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:58:48Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:58:48Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-06-07T11:58:47Z   OK        ready                   Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T11:58:47Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:58:47Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:58:47Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-06-07T11:58:44Z   OK        ready                   Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T11:58:44Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:58:44Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:58:44Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-06-07T11:57:58Z   OK        ready                   Endpoint policy update skipped because no changes were needed
2018-06-07T11:57:58Z   OK        ready                   Endpoint policy update skipped because no changes were needed
2018-06-07T11:57:45Z   OK        ready                   Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T11:57:45Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:57:45Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:57:45Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-06-07T11:57:45Z   OK        ready                   Endpoint policy update skipped because no changes were needed
2018-06-07T11:57:43Z   OK        ready                   Endpoint policy update skipped because no changes were needed
2018-06-07T11:57:42Z   OK        ready                   Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T11:57:42Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:57:42Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:57:42Z   OK        waiting-to-regenerate   Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T11:57:42Z   Warning   waiting-to-regenerate   Skipped invalid state transition to ready due to: Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:57:42Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-06-07T11:57:42Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:57:42Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-06-07T11:57:42Z   OK        ready                   Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T11:57:42Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:57:42Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:57:42Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-06-07T11:57:35Z   OK        ready                   Endpoint policy update skipped because no changes were needed
2018-06-07T11:57:20Z   OK        ready                   Endpoint policy update skipped because no changes were needed
2018-06-07T11:57:12Z   OK        ready                   Endpoint policy update skipped because no changes were needed
2018-06-07T11:56:53Z   OK        ready                   Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T11:56:53Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:56:53Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:56:53Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-06-07T11:56:52Z   OK        ready                   Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T11:56:52Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:56:52Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:56:52Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-06-07T11:56:52Z   OK        ready                   Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T11:56:52Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:56:52Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:56:52Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-06-07T11:56:00Z   OK        ready                   Endpoint policy update skipped because no changes were needed
2018-06-07T11:55:42Z   OK        ready                   Endpoint policy update skipped because no changes were needed
2018-06-07T11:55:25Z   OK        ready                   Endpoint policy update skipped because no changes were needed
2018-06-07T11:55:17Z   OK        ready                   Endpoint policy update skipped because no changes were needed
2018-06-07T11:54:59Z   OK        ready                   Endpoint policy update skipped because no changes were needed
2018-06-07T11:54:01Z   OK        ready                   Endpoint policy update skipped because no changes were needed
2018-06-07T11:53:43Z   OK        ready                   Endpoint policy update skipped because no changes were needed
2018-06-07T11:53:25Z   OK        ready                   Endpoint policy update skipped because no changes were needed
2018-06-07T11:53:06Z   OK        ready                   Endpoint policy update skipped because no changes were needed
2018-06-07T11:52:45Z   OK        ready                   Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T11:52:45Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:52:45Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:52:45Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-06-07T11:52:45Z   OK        ready                   Endpoint policy update skipped because no changes were needed
2018-06-07T11:52:03Z   OK        ready                   Endpoint policy update skipped because no changes were needed
2018-06-07T11:51:45Z   OK        ready                   Endpoint policy update skipped because no changes were needed
2018-06-07T11:50:53Z   OK        ready                   Endpoint policy update skipped because no changes were needed
2018-06-07T11:50:35Z   OK        ready                   Endpoint policy update skipped because no changes were needed
2018-06-07T11:50:12Z   OK        ready                   Endpoint policy update skipped because no changes were needed
2018-06-07T11:49:46Z   OK        ready                   Endpoint policy update skipped because no changes were needed
2018-06-07T11:49:28Z   OK        ready                   Endpoint policy update skipped because no changes were needed
2018-06-07T11:49:20Z   OK        ready                   Endpoint policy update skipped because no changes were needed
2018-06-07T11:49:06Z   OK        ready                   Endpoint policy update skipped because no changes were needed
2018-06-07T11:48:02Z   OK        ready                   Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T11:48:02Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:48:02Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:48:02Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-06-07T11:48:01Z   OK        ready                   Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T11:48:01Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:48:01Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:48:01Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-06-07T11:48:01Z   OK        ready                   Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T11:48:01Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:48:01Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:48:01Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-06-07T11:47:45Z   OK        ready                   Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T11:47:45Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:47:45Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:47:45Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-06-07T11:47:44Z   OK        ready                   Successfully regenerated endpoint program due to syncing state to host
2018-06-07T11:47:44Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:47:43Z   OK        regenerating            Regenerating Endpoint BPF: syncing state to host
2018-06-07T11:47:43Z   OK        waiting-to-regenerate   Triggering synchronous endpoint regeneration while syncing state to host
2018-06-07T11:47:43Z   OK        ready                   Synchronizing endpoint labels with KVStore
2018-06-07T11:47:43Z   Failure   ready                   Error regenerating endpoint: can't update network policy, proxy disabled
2018-06-07T11:47:43Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:47:43Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:47:42Z   Warning   waiting-to-regenerate   Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-07T11:47:42Z   Warning   waiting-to-regenerate   Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-07T11:47:42Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-06-07T11:47:42Z   OK        restoring               Restoring endpoint from previous cilium instance
2018-06-07T11:45:48Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:45:48Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-06-07T11:45:46Z   OK        ready                   Successfully regenerated endpoint program due to syncing state to host
2018-06-07T11:45:46Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:45:44Z   OK        regenerating            Regenerating Endpoint BPF: syncing state to host
2018-06-07T11:45:44Z   OK        waiting-to-regenerate   Triggering synchronous endpoint regeneration while syncing state to host
2018-06-07T11:45:44Z   OK        ready                   Synchronizing endpoint labels with KVStore
2018-06-07T11:45:43Z   Failure   ready                   Error regenerating endpoint: can't update network policy, proxy disabled
2018-06-07T11:45:43Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:45:43Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:45:43Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-06-07T11:45:43Z   Failure   ready                   Error regenerating endpoint: can't update network policy, proxy disabled
2018-06-07T11:45:43Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:45:43Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:45:43Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-06-07T11:45:43Z   Failure   ready                   Error regenerating endpoint: can't update network policy, proxy disabled
2018-06-07T11:45:43Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:45:43Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:45:43Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-06-07T11:45:43Z   OK        restoring               Restoring endpoint from previous cilium instance
2018-06-07T11:43:54Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:43:53Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-06-07T11:43:53Z   OK        ready                   Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T11:43:53Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:43:53Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:43:53Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-06-07T11:43:50Z   OK        ready                   Successfully regenerated endpoint program due to syncing state to host
2018-06-07T11:43:50Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:43:49Z   OK        regenerating            Regenerating Endpoint BPF: syncing state to host
2018-06-07T11:43:49Z   OK        waiting-to-regenerate   Triggering synchronous endpoint regeneration while syncing state to host
2018-06-07T11:43:49Z   OK        ready                   Synchronizing endpoint labels with KVStore
2018-06-07T11:43:49Z   Failure   ready                   Error regenerating endpoint: can't update network policy, proxy disabled
2018-06-07T11:43:49Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:43:49Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:43:48Z   Warning   waiting-to-regenerate   Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-07T11:43:48Z   Warning   waiting-to-regenerate   Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-07T11:43:48Z   Warning   waiting-to-regenerate   Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-07T11:43:48Z   Warning   waiting-to-regenerate   Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-07T11:43:48Z   Warning   waiting-to-regenerate   Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-07T11:43:48Z   Warning   waiting-to-regenerate   Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-07T11:43:48Z   Warning   waiting-to-regenerate   Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-07T11:43:48Z   Warning   waiting-to-regenerate   Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-07T11:43:48Z   Warning   waiting-to-regenerate   Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-07T11:43:48Z   Warning   waiting-to-regenerate   Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-07T11:43:48Z   Warning   waiting-to-regenerate   Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-07T11:43:48Z   Warning   waiting-to-regenerate   Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-07T11:43:48Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-06-07T11:43:48Z   OK        restoring               Restoring endpoint from previous cilium instance
2018-06-07T11:41:42Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:41:41Z   Warning   waiting-to-regenerate   Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-07T11:41:41Z   Warning   waiting-to-regenerate   Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-07T11:41:41Z   Warning   waiting-to-regenerate   Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-07T11:41:41Z   Warning   waiting-to-regenerate   Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-07T11:41:41Z   Warning   waiting-to-regenerate   Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-07T11:41:41Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-06-07T11:41:25Z   OK        ready                   Successfully regenerated endpoint program due to updated security labels
2018-06-07T11:41:25Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:41:24Z   OK        regenerating            Regenerating Endpoint BPF: updated security labels
2018-06-07T11:41:24Z   OK        waiting-to-regenerate   Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-07T11:41:24Z   Warning   waiting-to-regenerate   Skipped invalid state transition to ready due to: Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:41:23Z   OK        waiting-to-regenerate   Triggering regeneration due to new identity
2018-06-07T11:41:23Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-07T11:41:23Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-06-07T11:41:22Z   OK        ready                   Successfully regenerated endpoint program due to updated security labels
2018-06-07T11:41:22Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-06-07T11:41:21Z   OK        regenerating            Regenerating Endpoint BPF: updated security labels
2018-06-07T11:41:21Z   OK        waiting-to-regenerate   Triggering regeneration due to new identity
2018-06-07T11:41:21Z   OK        ready                   Set identity for this endpoint

Identity get 8333

ID     LABELS
8333   [k8s:io.kubernetes.pod.namespace=kube-system k8s:k8s-app=kube-dns]

Is the init policy again in place? Maybe is that?

The kube-dns has the request in there:

app1-service
I0607 12:05:14.942276       1 dns.go:305] Service details: &Service{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:app1-service,GenerateName:,Namespace:default,SelfLink:/api/v1/namespaces/default/services/app1-service,UID:0982d6b8-6a4b-11e8-8cfe-080027aa233b,ResourceVersion:5548,Generation:0,CreationTimestamp:2018-06-07 12:05:14 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{kubectl.kubernetes.io/last-applied-configuration: {"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"name":"app1-service","namespace":"default"},"spec":{"ports":[{"port":80}],"selector":{"id":"app1"}}}
,},OwnerReferences:[],Finalizers:[],ClusterName:,},Spec:ServiceSpec{Ports:[{ TCP 80 {0 80 } 0}],Selector:map[string]string{id: app1,},ClusterIP:10.108.31.44,Type:ClusterIP,ExternalIPs:[],DeprecatedPublicIPs:[],SessionAffinity:None,LoadBalancerIP:,LoadBalancerSourceRanges:[],ExternalName:,},Status:ServiceStatus{LoadBalancer:LoadBalancerStatus{Ingress:[],},},}
I0607 12:05:14.942339       1 util.go:68] Constructed new DNS record: &{10.108.31.44 0 10 10  false 30 0  }, hash:5a949e49
I0607 12:05:14.942353       1 util.go:68] Constructed new DNS record: &{app1-service.default.svc.cluster.local. 0 10 10  false 30 0  }, hash:e4edfa2d
I0607 12:05:16.047579       1 dns.go:482] No service for
I0607 12:05:38.604554       1 dns.go:637] Query for "app1-service.default.svc.cluster.local.", exact: false
I0607 12:05:38.604575       1 dns.go:866] Not a federation query: len(["app1-service" "default" "svc" "cluster" "local"]) != 4+len(["local" "cluster"])
I0607 12:05:38.604588       1 dns.go:757] Found 1 records for [local cluster svc default app1-service] in the cache
I0607 12:05:38.604594       1 dns.go:764] getRecordsForPath retval=[{Host:10.108.31.44 Port:0 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:/skydns/local/cluster/svc/default/app1-service/3561393439653439}], path=[local cluster svc default app1-service]
I0607 12:05:38.604617       1 dns.go:666] Records for app1-service.default.svc.cluster.local.: [{10.108.31.44 0 10 10  false 30 0  /skydns/local/cluster/svc/default/app1-service/3561393439653439}]
I0607 12:05:38.829950       1 dns.go:637] Query for "kubernetes.default.svc.cluster.local.", exact: false

I've just review&compare this one:

https://jenkins.cilium.io/job/Ginkgo-CI-Tests-Pipeline/2730/

f0c5341e_K8sValidatedUpdates_Updating_Cilium_stable_to_master.1-1.zip

And:

BPF Policy Get 57415

DIRECTION   LABELS (source:key[=value])   PORT/PROTO   BYTES   PACKETS   
Ingress     reserved:host                 ANY          0       0         

Endpoint Get 57415

[
  {
    "id": 57415,
    "spec": {
      "label-configuration": {
        "user": []
      },
      "options": {
        "Conntrack": "Enabled",
        "ConntrackAccounting": "Enabled",
        "ConntrackLocal": "Disabled",
        "Debug": "Enabled",
        "DebugLB": "Disabled",
        "DropNotification": "Enabled",
        "EgressPolicy": "Disabled",
        "IngressPolicy": "Disabled",
        "NAT46": "Disabled",
        "TraceNotification": "Enabled"
      }
    },
    "status": {
      "controllers": [
        {
          "configuration": {
            "error-retry": true,
            "interval": "5m0s"
          },
          "name": "sync-IPv4-identity-mapping (57415)",
          "status": {
            "last-failure-timestamp": "0001-01-01T00:00:00.000Z",
            "last-success-timestamp": "2018-06-07T02:33:21.130Z",
            "success-count": 2
          },
          "uuid": "72110962-69fa-11e8-a7f4-080027aa233b"
        },
        {
          "configuration": {
            "error-retry": true,
            "interval": "5m0s"
          },
          "name": "sync-IPv6-identity-mapping (57415)",
          "status": {
            "last-failure-timestamp": "0001-01-01T00:00:00.000Z",
            "last-success-timestamp": "2018-06-07T02:33:21.130Z",
            "success-count": 2
          },
          "uuid": "72110cc4-69fa-11e8-a7f4-080027aa233b"
        },
        {
          "configuration": {
            "error-retry": true,
            "interval": "1m0s"
          },
          "name": "sync-identity-to-k8s-pod (57415)",
          "status": {
            "last-failure-timestamp": "0001-01-01T00:00:00.000Z",
            "last-success-timestamp": "2018-06-07T02:36:21.297Z",
            "success-count": 9
          },
          "uuid": "7211037f-69fa-11e8-a7f4-080027aa233b"
        },
        {
          "configuration": {
            "error-retry": true,
            "interval": "10s"
          },
          "name": "sync-to-k8s-ciliumendpoint (57415)",
          "status": {
            "last-failure-timestamp": "0001-01-01T00:00:00.000Z",
            "last-success-timestamp": "2018-06-07T02:36:43.965Z",
            "success-count": 51
          },
          "uuid": "72109305-69fa-11e8-a7f4-080027aa233b"
        }
      ],
      "external-identifiers": {
        "container-id": "b774fb41d748cae3d1f454765aa16af647efe3cb66b3576b49b039dc16e24f82",
        "container-name": "k8s_POD_kube-dns-1992960195-39zrf_kube-system_ffe1c5e3-69f8-11e8-bc1d-080027aa233b_0",
        "pod-name": "kube-system:kube-dns-1992960195-39zrf"
      },
      "health": {
        "bpf": "OK",
        "connected": true,
        "overallHealth": "OK",
        "policy": "OK"
      },
      "identity": {
        "id": 57084,
        "labels": [
          "k8s:io.kubernetes.pod.namespace=kube-system",
          "k8s:k8s-app=kube-dns"
        ],
        "labelsSHA256": "1b2bafe403e0d433f8307836b07ade2cf802511501f880917e128b7f663ac313"
      },
      "labels": {
        "derived": [
          "container:annotation.kubernetes.io/config.seen=2018-06-07T02:26:12.326781187Z",
          "container:annotation.kubernetes.io/config.source=api",
          container:annotation.kubernetes.io/created-by={
            "apiVersion": "v1",
            "kind": "SerializedReference",
            "reference": {
              "apiVersion": "extensions",
              "kind": "ReplicaSet",
              "name": "kube-dns-1992960195",
              "namespace": "kube-system",
              "resourceVersion": "370",
              "uid": "ffe02ef6-69f8-11e8-bc1d-080027aa233b"
            }
          }
,
          "container:annotation.scheduler.alpha.kubernetes.io/critical-pod",
          "container:io.kubernetes.container.name=POD",
          "container:io.kubernetes.docker.type=podsandbox",
          "container:io.kubernetes.pod.name=kube-dns-1992960195-39zrf",
          "container:io.kubernetes.pod.uid=ffe1c5e3-69f8-11e8-bc1d-080027aa233b",
          "k8s:pod-template-hash=1992960195"
        ],
        "disabled": [],
        "realized": {
          "user": []
        },
        "security-relevant": [
          "k8s:io.kubernetes.pod.namespace=kube-system",
          "k8s:k8s-app=kube-dns"
        ]
      },
      "log": [
        {
          "code": "OK",
          "message": "Successfully regenerated endpoint program due to endpoint policy updated & changes were needed",
          "state": "ready",
          "timestamp": "2018-06-07T02:33:27Z"
        }
      ],
      "networking": {
        "addressing": [
          {
            "ipv4": "10.10.1.179",
            "ipv6": "f00d::a0a:100:0:e047"
          }
        ],
        "host-mac": "66:ec:48:9e:67:8d",
        "interface-index": 14,
        "interface-name": "lxcb774f",
        "mac": "ca:f3:53:c2:88:2b"
      },
      "policy": {
        "proxy-statistics": [],
        "realized": {
          "allowed-egress-identities": [],
          "allowed-ingress-identities": [
            1
          ],
          "build": 7,
          "cidr-policy": {
            "egress": [],
            "ingress": []
          },
          "id": 57084,
          "l4": {
            "egress": [],
            "ingress": []
          },
          "policy-enabled": "none",
          "policy-revision": 7
        },
        "spec": {
          "allowed-egress-identities": [],
          "allowed-ingress-identities": [
            1
          ],
          "build": 7,
          "cidr-policy": {
            "egress": [],
            "ingress": []
          },
          "id": 57084,
          "l4": {
            "egress": [],
            "ingress": []
          },
          "policy-enabled": "none",
          "policy-revision": 7
        }
      },
      "realized": {
        "label-configuration": {
          "user": []
        },
        "options": {
          "Conntrack": "Enabled",
          "ConntrackAccounting": "Enabled",
          "ConntrackLocal": "Disabled",
          "Debug": "Enabled",
          "DebugLB": "Disabled",
          "DropNotification": "Enabled",
          "EgressPolicy": "Disabled",
          "IngressPolicy": "Disabled",
          "NAT46": "Disabled",
          "TraceNotification": "Enabled"
        }
      },
      "state": "ready"
    }
  }
]

BPF Policy enabled: Reserved:host
Policy get: none.

Logs:

2018-06-07T02:33:26.019881734Z level=debug msg="egress verdict: undecided" endpointID=57415 labels="[k8s:id=app3 k8s:io.kubernetes.pod.namespace=default k8s:zgroup=testapp]" policyID=35564
2018-06-07T02:33:26.019883807Z level=debug msg="Evaluating context for destination PolicyID" containerID=cilium-loc ctx="{0 0 <nil> [reserved:health] [reserved:host] []}" endpointID=29898 policyID=host policyRevision=6
2018-06-07T02:33:26.01988918Z level=debug msg="Evaluating ingress context for source PolicyID" containerID=b774fb41d7 endpointID=57415 ingress_context="{0 0 <nil> [k8s:io.kubernetes.pod.namespace=kube-system k8s:k8s-app=kube-dns] [k8s:k8s-app=kube-dns k8s:io.kubernetes.pod.namespace=kube-system] []}" policyID=57084 policyRevision=6
2018-06-07T02:33:26.019891684Z level=debug msg="Evaluating egress context for source PolicyID" containerID=b774fb41d7 egress_context="{0 0 <nil> [k8s:k8s-app=kube-dns k8s:io.kubernetes.pod.namespace=kube-system] [k8s:io.kubernetes.pod.namespace=kube-system k8s:k8s-app=kube-dns] []}" endpointID=57415 policyID=57084 policyRevision=6
2018-06-07T02:33:26.019893985Z level=debug msg="egress verdict: undecided" endpointID=57415 labels="[k8s:io.kubernetes.pod.namespace=kube-system k8s:k8s-app=kube-dns]" policyID=57084
2018-06-07T02:33:26.019896457Z level=debug msg="Evaluating ingress context for source PolicyID" containerID=b774fb41d7 endpointID=57415 ingress_context="{0 0 <nil> [reserved:host] [k8s:k8s-app=kube-dns k8s:io.kubernetes.pod.namespace=kube-system] []}" policyID=host policyRevision=6
2018-06-07T02:33:26.019912356Z level=debug msg="Evaluating egress context for source PolicyID" containerID=b774fb41d7 egress_context="{0 0 <nil> [k8s:k8s-app=kube-dns k8s:io.kubernetes.pod.namespace=kube-system] [reserved:host] []}" endpointID=57415 policyID=host policyRevision=6
2018-06-07T02:33:26.019914571Z level=debug msg="egress verdict: undecided" endpointID=57415 labels="[reserved:host]" policyID=host
2018-06-07T02:33:26.019916556Z level=debug msg="Evaluating ingress context for source PolicyID" containerID=b774fb41d7 endpointID=57415 ingress_context="{0 0 <nil> [reserved:world] [k8s:k8s-app=kube-dns k8s:io.kubernetes.pod.namespace=kube-system] []}" policyID=world policyRevision=6
2018-06-07T02:33:26.019930435Z level=debug msg="Evaluating egress context for source PolicyID" containerID=b774fb41d7 egress_context="{0 0 <nil> [k8s:k8s-app=kube-dns k8s:io.kubernetes.pod.namespace=kube-system] [reserved:world] []}" endpointID=57415 policyID=world policyRevision=6
2018-06-07T02:33:26.019932762Z level=debug msg="Evaluating context for destination PolicyID" containerID=cilium-loc ctx="{0 0 <nil> [reserved:health] [reserved:world] []}" endpointID=29898 policyID=world policyRevision=6
2018-06-07T02:33:26.019934872Z level=debug msg="egress verdict: undecided" endpointID=57415 labels="[reserved:world]" policyID=world

And this can explain why not always fail, because if the kube-dns is on k8s2, the k8s1 dig cannot get access to that.

So, I finally saw a test fail on https://github.com/cilium/cilium/pull/4406 (this tested undoing the MTU changes). The errors are a little different, and the branch doesn't include adde5de7053849bf0475c9047f40b6d20a04316c but that isn't terribly relevant.

tl;dr we seem to be missing routes to clusterIPs, so kubedns can't connect to the kubeapiserver. Cilium can, however, so I'm not sure I interpreted this correctly.

f0c5341e_K8sValidatedUpdates_Updating_Cilium_stable_to_master.zip

The most interesting error is from kubedns.log
E0612 19:23:34.370631 143 reflector.go:201] k8s.io/dns/pkg/dns/dns.go:192: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?resourceVersion=0: dial tcp 10.96.0.1:443: getsockopt: no route to host

Note that kubedns is running on k8s2 but the apiserver is on k8s1. This is what @eloycoto pointed out above. I had hoped it was because of the MTU stuff but it isn't. It's also worth noting that kubedns is the only pod with restartCount >0. kubedns: 5, sidecar:4 and dnsmasq: 4.
The cilium pod on k8s2 is cilium-490kg.
Neither cilium has the kubedns in the lb map.

The cilium-health report is also bad, like what @eloycoto saw:

cilium-health status

Probe time:   2018-06-12T19:23:25Z
Nodes:
  k8s2 (localhost):
    Host connectivity to 192.168.36.12:
      ICMP:          OK, RTT=409.49µs
      HTTP via L3:   OK, RTT=329.567µs
    Endpoint connectivity to 10.10.1.172:
      ICMP:          Connection timed out
      HTTP via L3:   Connection timed out
  k8s1:
    Host connectivity to 192.168.36.11:
      ICMP:          OK, RTT=662.562µs
      HTTP via L3:   OK, RTT=1.031337ms
    Endpoint connectivity to 10.10.0.172:
      ICMP:          Connection timed out
      HTTP via L3:   Connection timed out

Oddly, cilium doesn't know about kubedns

ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])   IPv6                   IPv4          STATUS   
           ENFORCEMENT        ENFORCEMENT                                                                                     
29898      Disabled           Disabled          25244      reserved:health               f00d::a0a:100:0:74ca   10.10.1.172   ready   

We seem to be missing the kube-apiserver cluster IP route (clusterIP: 10.96.0.1, backend: 192.168.36.11). I'm not sure why. The podCIDRs for the nodes are 10.10.0.0/24 and 10.10.1.0/24.

k8s2 ip r

default via 10.0.2.2 dev enp0s3 proto dhcp src 10.0.2.15 metric 100 
10.0.2.0/24 dev enp0s3 proto kernel scope link src 10.0.2.15 
10.0.2.2 dev enp0s3 proto dhcp scope link src 10.0.2.15 metric 100 
10.10.0.0/24 via 10.10.1.44 dev cilium_host src 10.10.1.44 
10.10.1.0/24 via 10.10.1.44 dev cilium_host src 10.10.1.44 
10.10.1.44 dev cilium_host scope link 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 
172.28.128.0/24 dev enp0s9 proto kernel scope link src 172.28.128.5 
192.168.36.0/24 dev enp0s8 proto kernel scope link src 192.168.36.12 

k8s1 ip r

default via 10.0.2.2 dev enp0s3 proto dhcp src 10.0.2.15 metric 100 
10.0.2.0/24 dev enp0s3 proto kernel scope link src 10.0.2.15 
10.0.2.2 dev enp0s3 proto dhcp scope link src 10.0.2.15 metric 100 
10.10.0.0/24 via 10.10.0.44 dev cilium_host src 10.10.0.44 
10.10.0.44 dev cilium_host scope link 
10.10.1.0/24 via 10.10.0.44 dev cilium_host src 10.10.0.44 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 
172.28.128.0/24 dev enp0s9 proto kernel scope link src 172.28.128.4 
192.168.36.0/24 dev enp0s8 proto kernel scope link src 192.168.36.11 

Looking through the logs for 172.17 we see

$ grep -E '172.17' cilium-490kg-logs.log
2018-06-12T19:15:20.565200637Z level=debug msg="Considering route" route="{Ifindex: 5 Dst: 172.17.0.0/16 Src: 172.17.0.1 Gw: <nil> Flags: [] Table: 254}"
2018-06-12T19:15:20.568913369Z level=debug msg="Considering removing iptables rule" obj="-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE"
2018-06-12T19:15:20.568923164Z level=debug msg="Considering removing iptables rule" obj="-A POSTROUTING -s 172.17.0.2/32 -d 172.17.0.2/32 -p tcp -m tcp --dport 5000 -j MASQUERADE"
2018-06-12T19:15:20.576692598Z level=debug msg="Considering removing iptables rule" obj="-A DOCKER ! -i docker0 -p tcp -m tcp --dport 5000 -j DNAT --to-destination 172.17.0.2:5000"
2018-06-12T19:15:20.594023875Z level=debug msg="Considering removing iptables rule" obj="-A DOCKER -d 172.17.0.2/32 ! -i docker0 -o docker0 -p tcp -m tcp --dport 5000 -j ACCEPT"
2018-06-12T19:15:21.990182549Z level=debug msg="Skipping network because of gateway mismatch" endpointID=e78e9f05f07d266375ad9fc29fb94197bc492bed7dd83cda5c1a8c2362d4414e obj="&{<nil> [] [] ae6644e1a2b3b1c607dc92bd29bdc96e5196b32414d1784d383063693cc4548b e78e9f05f07d266375ad9fc29fb94197bc492bed7dd83cda5c1a8c2362d4414e 172.17.0.1 172.17.0.2 16   0 02:42:ac:11:00:02}" subsys=containerd-watcher

and looking for that message
pkg/workloads/docker.go:380: scopedLog.WithField(logfields.Object, contNetwork).Debug("Skipping network because of gateway mismatch")
Oddly, that happens in getCiliumIPv6 and that is used when deciding to ignore workloads in pkg/workloads/docker.go.

10.96 is mentioned in a similar context:

2018-06-12T19:15:20.577310354Z level=debug msg="Considering removing iptables rule" obj="-A KUBE-SERVICES ! -s 10.10.0.0/16 -d 10.96.0.1/32 -p tcp -m comment --comment \"default/kubernetes:https cluster IP\" -m tcp --dport 443 -j KUBE-MARK-MASQ"
2018-06-12T19:15:20.577394862Z level=debug msg="Considering removing iptables rule" obj="-A KUBE-SERVICES -d 10.96.0.1/32 -p tcp -m comment --comment \"default/kubernetes:https cluster IP\" -m tcp --dport 443 -j KUBE-SVC-NPX46M4PTMTKRN6Y"

but has no skip/remove log

@eloycoto could this be a conflict between a hardcoded podCIDR CLI option to cilium and a derived value from k8s?

Today I got this one in my laptop:

Traces:
K8s1:k8s1-monitor.log
k8s2:k8s2-monitor.log

First Sysdymp:
cilium-sysdump-20180615-131816.zip
Second Sysdump:
cilium-sysdump-20180615-134008.zip

If you see the list of pods, the 10.10.1.66 IP is duplicated:

Every 2.0s: kubectl get pods --all-namespaces -o wide                                                                 

NAMESPACE     NAME                           READY     STATUS             RESTARTS   AGE       IP              NODE   
default       testclient-gddnl               1/1       Running            0          20m       10.10.0.32      k8s1   
default       testclient-zz5p6               1/1       Running            0          20m       10.10.1.32      k8s2   
default       testds-2mvvn                   1/1       Running            0          20m       10.10.1.66      k8s2   
default       testds-4fsvg                   1/1       Running            0          20m       10.10.0.66      k8s1   
kube-system   cilium-5zhqx                   1/1       Running            0          24m       192.168.36.11   k8s1   
kube-system   cilium-db2mx                   1/1       Running            0          24m       192.168.36.12   k8s2   
kube-system   etcd-k8s1                      1/1       Running            1          1h        192.168.36.11   k8s1   
kube-system   kube-apiserver-k8s1            1/1       Running            1          1h        192.168.36.11   k8s1   
kube-system   kube-controller-manager-k8s1   1/1       Running            1          1h        192.168.36.11   k8s1   
kube-system   kube-dns-59d8c5f9b5-7m225      2/3       CrashLoopBackOff   39         1h        10.10.1.66      k8s2   
kube-system   kube-proxy-5tfmw               1/1       Running            1          1h        192.168.36.12   k8s2   
kube-system   kube-proxy-zclzl               1/1       Running            1          1h        192.168.36.11   k8s1   
kube-system   kube-scheduler-k8s1            1/1       Running            1          1h        192.168.36.11   k8s1   

1) Clean the CT and it didn't work.
2) 2/3 kube-dns are created, but is not in the endpoint list.
3) Testing connectivity from k8s1 to k8s2 and it works (It's not a vbox problem)
4) Test connectivity from k8s2 to k8s1 api and it's works:

vagrant@k8s1:~$ kubectl exec -ti testclient-zz5p6 -- curl -v https://10.96.0.1:443 --insecure            
* Rebuilt URL to: https://10.96.0.1:443/                                                                 
*   Trying 10.96.0.1...                                                                                  
* TCP_NODELAY set                                                                                        
* Connected to 10.96.0.1 (10.96.0.1) port 443 (#0)                                                       
* ALPN, offering http/1.1                                                                                
* ignoring certificate verify locations due to disabled peer verification                                
* TLSv1.2 (OUT), TLS handshake, Client hello (1):                                                        
* TLSv1.2 (IN), TLS handshake, Server hello (2):                                                         
* TLSv1.2 (IN), TLS handshake, Certificate (11):                                                         
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):                                                 
* TLSv1.2 (IN), TLS handshake, Request CERT (13):                                                        
* TLSv1.2 (IN), TLS handshake, Server finished (14):                                                     
* TLSv1.2 (OUT), TLS handshake, Certificate (11):                                                        
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):                                                
* TLSv1.2 (OUT), TLS change cipher, Client hello (1):                                                    
* TLSv1.2 (OUT), TLS handshake, Finished (20):                                                           
* TLSv1.2 (IN), TLS change cipher, Client hello (1):                                                     
* TLSv1.2 (IN), TLS handshake, Finished (20):                                                            
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256                                             
* ALPN, server accepted to use http/1.1                                                                  
* Server certificate:                                                                                    
*  subject: CN=kube-apiserver                                                                            
*  start date: Jun 15 11:54:08 2018 GMT                                                                  
*  expire date: Jun 15 11:54:08 2019 GMT                                                                 
*  issuer: CN=kubernetes                                                                                 
*  SSL certificate verify result: unable to get local issuer certificate (20), continuing anyway.        
> GET / HTTP/1.1                                                                                         
> Host: 10.96.0.1                                                                                        
> User-Agent: curl/7.59.0                                                                                
> Accept: */*                                                                                            
>                                                                                                        
< HTTP/1.1 403 Forbidden                                                                                 
< Content-Type: application/json                                                                         
< X-Content-Type-Options: nosniff                                                                        
< Date: Fri, 15 Jun 2018 13:38:58 GMT                                                                    
< Content-Length: 233                                                                                    
<                                                                                                        
{                                                                                                        
  "kind": "Status",                                                                                      
  "apiVersion": "v1",                                                                                    
  "metadata": {                                                                                          

  },                                                                                                     
  "status": "Failure",                                                                                   
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",                               
  "reason": "Forbidden",                                                                                 
  "details": {                                                                                           

  },                                                                                                     
  "code": 403                                                                                            
* Connection #0 to host 10.96.0.1 left intact                                                            
}vagrant@k8s1:~$                                                                                         

The message from kube-dns is still that can't reach 10.96.0.1:443

After seen this one in my laptop, I think that it's not a datapath issue at all. The endpoint trying to regenerate and always fails, the lookup to endpoint manager is always nil, and ep=nil:

https://github.com/cilium/cilium/blob/d64c17727218f5a938497c251753cfd96685f884/pkg/workloads/docker.go#L438-L448

More logs:


vagrant@k8s2:~$ docker ps -a
CONTAINER ID        IMAGE                         COMMAND                   CREATED              STATUS                         PORTS                    NAMES
bcd18fdf5852        8a7739f672b4                  "/sidecar --v=2 --lo…"    About a minute ago   Exited (2) 25 seconds ago                               k8s_sidecar_kube-dns-59d8c5f9b5-7m225_kube-system_f5ca6dd2-7092-11e8-8380-0800274fedc6_22
cf262dd02909        55ffe31ac578                  "/kube-dns --domain=…"    4 minutes ago        Exited (255) 3 minutes ago                              k8s_kubedns_kube-dns-59d8c5f9b5-7m225_kube-system_f5ca6dd2-7092-11e8-8380-0800274fedc6_17
7dde69c74df2        6816817d9dce                  "/dnsmasq-nanny -v=5…"    4 minutes ago        Exited (137) 2 minutes ago                              k8s_dnsmasq_kube-dns-59d8c5f9b5-7m225_kube-system_f5ca6dd2-7092-11e8-8380-0800274fedc6_20
77433b91049c        cilium/demo-client            "sleep 1000h"             About an hour ago    Up About an hour                                        k8s_web_testclient-zz5p6_default_47b0a741-709e-11e8-b8cf-0800274fedc6_0
d6207c6a6cb1        cilium/demo-httpd             "httpd-foreground"        About an hour ago    Up About an hour                                        k8s_web_testds-2mvvn_default_47a7ac1a-709e-11e8-b8cf-0800274fedc6_0
ba7749c5d5c2        k8s.gcr.io/pause-amd64:3.1    "/pause"                  About an hour ago    Up About an hour                                        k8s_POD_testclient-zz5p6_default_47b0a741-709e-11e8-b8cf-0800274fedc6_0
ede995f28e11        k8s.gcr.io/pause-amd64:3.1    "/pause"                  About an hour ago    Up About an hour                                        k8s_POD_testds-2mvvn_default_47a7ac1a-709e-11e8-b8cf-0800274fedc6_0
26f6f9b98f5a        cilium/cilium                 "cilium-agent --debu…"    About an hour ago    Up About an hour                                        k8s_cilium-agent_cilium-db2mx_kube-system_baa559da-709d-11e8-b8cf-0800274fedc6_0
33db50b78b93        8c811b4aec35                  "sh -c 'if [ \"${CLEA…"   About an hour ago    Exited (0) About an hour ago                            k8s_clean-cilium-state_cilium-db2mx_kube-system_baa559da-709d-11e8-b8cf-0800274fedc6_0
2f38b0d512df        k8s.gcr.io/pause-amd64:3.1    "/pause"                  About an hour ago    Up About an hour                                        k8s_POD_cilium-db2mx_kube-system_baa559da-709d-11e8-b8cf-0800274fedc6_0
8738c05c7b2f        k8s.gcr.io/pause-amd64:3.1    "/pause"                  2 hours ago          Up 2 hours                                              k8s_POD_kube-dns-59d8c5f9b5-7m225_kube-system_f5ca6dd2-7092-11e8-8380-0800274fedc6_1
ee091faeaca0        4261d315109d                  "/usr/local/bin/kube…"    2 hours ago          Up 2 hours                                              k8s_kube-proxy_kube-proxy-5tfmw_kube-system_f56a5b30-7096-11e8-8380-0800274fedc6_1
bf7736dcf6d9        k8s.gcr.io/pause-amd64:3.1    "/pause"                  2 hours ago          Up 2 hours                                              k8s_POD_kube-proxy-5tfmw_kube-system_f56a5b30-7096-11e8-8380-0800274fedc6_1
e439b65e1f1a        k8s.gcr.io/kube-proxy-amd64   "/usr/local/bin/kube…"    2 hours ago          Exited (255) 2 hours ago                                k8s_kube-proxy_kube-proxy-5tfmw_kube-system_f56a5b30-7096-11e8-8380-0800274fedc6_0
a4965ede214f        k8s.gcr.io/pause-amd64:3.1    "/pause"                  2 hours ago          Exited (255) 2 hours ago                                k8s_POD_kube-proxy-5tfmw_kube-system_f56a5b30-7096-11e8-8380-0800274fedc6_0
c6fb55808bad        registry:2                    "/entrypoint.sh /etc…"    9 days ago           Up 2 hours                     0.0.0.0:5000->5000/tcp   registry
vagrant@k8s2:~$ docker inspect 55ffe31ac578
[
    {
        "Id": "sha256:55ffe31ac5789896865300074e914ff3b699050663a096f757f8ecc1d71c8aad",
        "RepoTags": [
            "k8s.gcr.io/k8s-dns-kube-dns-amd64:1.14.10"
        ],
        "RepoDigests": [
            "k8s.gcr.io/k8s-dns-kube-dns-amd64@sha256:b99fc3eee2a9f052f7eb4cc00f15eb12fc405fa41019baa2d6b79847ae7284a8"
        ],
        "Parent": "",
        "Comment": "",
        "Created": "2018-04-16T21:58:45.624829023Z",
        "Container": "afa9187fd1abd28f56f4b399251f7154bfb82ec30957bb42d6e8ff5ac7f1b612",
        "ContainerConfig": {
            "Hostname": "afa9187fd1ab",
            "Domainname": "",
            "User": "",
            "AttachStdin": false,
            "AttachStdout": false,
            "AttachStderr": false,
            "Tty": false,
            "OpenStdin": false,
            "StdinOnce": false,
            "Env": [
                "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
            ],
            "Cmd": [
                "/bin/sh",
                "-c",
                "#(nop) ",
                "ENTRYPOINT [\"/kube-dns\"]"
            ],
            "ArgsEscaped": true,
            "Image": "sha256:f2b4439632672cabd98dfb861fb1ee420568876ec7ccf23c63928462329f8f1e",
            "Volumes": null,
            "WorkingDir": "",
            "Entrypoint": [
                "/kube-dns"
            ],
            "OnBuild": null,
            "Labels": {}
        },
        "DockerVersion": "17.12.0-ce",
        "Author": "Bowei Du <[email protected]>",
        "Config": {
            "Hostname": "",
            "Domainname": "",
            "User": "",
            "AttachStdin": false,
            "AttachStdout": false,
            "AttachStderr": false,
            "Tty": false,
            "OpenStdin": false,
            "StdinOnce": false,
            "Env": [
                "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
            ],
            "Cmd": null,
            "ArgsEscaped": true,
            "Image": "sha256:f2b4439632672cabd98dfb861fb1ee420568876ec7ccf23c63928462329f8f1e",
            "Volumes": null,
            "WorkingDir": "",
            "Entrypoint": [
                "/kube-dns"
            ],
            "OnBuild": null,
            "Labels": null
        },
        "Architecture": "amd64",
        "Os": "linux",
        "Size": 49549457,
        "VirtualSize": 49549457,
        "GraphDriver": {
            "Data": {
                "LowerDir": "/var/lib/docker/overlay2/05f015ed83f0947b896a8865466c9d84969b0054e2d71b1820375cf1818061da/diff",
                "MergedDir": "/var/lib/docker/overlay2/120258e0074e6dc222397c7de56dc01f5c89662d3ed4061c4b79b5e4fd97fc18/merged",
                "UpperDir": "/var/lib/docker/overlay2/120258e0074e6dc222397c7de56dc01f5c89662d3ed4061c4b79b5e4fd97fc18/diff",
                "WorkDir": "/var/lib/docker/overlay2/120258e0074e6dc222397c7de56dc01f5c89662d3ed4061c4b79b5e4fd97fc18/work"
            },
            "Name": "overlay2"
        },
        "RootFS": {
            "Type": "layers",
            "Layers": [
                "sha256:cd7100a72410606589a54b932cabd804a17f9ae5b42a1882bd56d263e02b6215",
                "sha256:1a0adecd63853941c0436bb4e1299c2ebebfd89f0b25c2e93c1cd7d518d12b89"
            ]
        },
        "Metadata": {
            "LastTagTime": "0001-01-01T00:00:00Z"
        }
    }
]

@jrajahalme I noticed you rebase after the failure. When it failed, did your branch include https://github.com/cilium/cilium/pull/4512 ?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

tklauser picture tklauser  Â·  3Comments

brb picture brb  Â·  4Comments

christarazi picture christarazi  Â·  4Comments

aanm picture aanm  Â·  3Comments

danwent picture danwent  Â·  4Comments