Cilium: CI: ip-identity watcher stopped receiving events, causing egress policy test failures

Created on 1 May 2018  路  20Comments  路  Source: cilium/cilium

Looking at the CI run out put here:
https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated/2839/testReport/junit/(root)/runtime/RuntimeValidatedPolicies_L3_L4_Checks/

monitor.log shows (showing only relevant lines):

[2018-04-30 22:10:30] CPU 01: MARK 0xd943dbd1 FROM 17990 DEBUG: Failed to map daddr=10.15.157.163 to identity
[2018-04-30 22:10:31] CPU 01: MARK 0xd1f1055e FROM 17990 DEBUG: Failed to map daddr=10.15.157.163 to identity
[2018-04-30 22:10:33] CPU 01: MARK 0x6def8c8 FROM 17990 DEBUG: Failed to map daddr=10.15.157.163 to identity
[2018-04-30 22:10:37] CPU 01: MARK 0x7243a868 FROM 17990 DEBUG: Failed to map daddr=10.15.83.26 to identity
[2018-04-30 22:10:38] CPU 01: MARK 0xb1e679db FROM 17990 DEBUG: Failed to map daddr=10.15.83.26 to identity
[2018-04-30 22:10:40] CPU 01: MARK 0xbb1aa16e FROM 17990 DEBUG: Failed to map daddr=10.15.83.26 to identity
[2018-04-30 22:10:42] CPU 01: MARK 0xc601188 FROM 17990 DEBUG: Failed to map daddr.p4=[::0:f426] to identity
[2018-04-30 22:10:43] CPU 01: MARK 0x68a2275c FROM 17990 DEBUG: Failed to map daddr.p4=[::0:f426] to identity
[2018-04-30 22:10:45] CPU 01: MARK 0xca8dbcd9 FROM 17990 DEBUG: Failed to map daddr.p4=[::0:f426] to identity
[2018-04-30 22:10:48] CPU 01: MARK 0xe9251566 FROM 17607 DEBUG: Failed to map daddr=10.15.83.26 to identity
[2018-04-30 22:10:49] CPU 01: MARK 0x5daf9c85 FROM 17607 DEBUG: Failed to map daddr=10.15.83.26 to identity
[2018-04-30 22:10:51] CPU 01: MARK 0x9a98f2ba FROM 17607 DEBUG: Failed to map daddr=10.15.83.26 to identity

Both the IPv4 (10.15.83.26) and IPv6 (::0:f426 should match f00d::a0f:0:0:f426?) belong to the container:id.httpd2 (from test-output.log):

17607      Disabled           Enabled           49868      container:id.app3             f00d::a0f:0:0:44c7   10.15.195.104   ready   
17990      Disabled           Enabled           10211      container:id.app2             f00d::a0f:0:0:4646   10.15.106.144   ready   
18544      Enabled            Disabled          37841      container:id.httpd1           f00d::a0f:0:0:4870   10.15.157.163   ready   
                                                           container:id.service1                                                      
28441      Disabled           Disabled          61150      container:id.app1             f00d::a0f:0:0:6f19   10.15.77.25     ready   
29898      Disabled           Disabled          4319       reserved:health               f00d::a0f:0:0:74ca   10.15.242.54    ready   
49896      Disabled           Disabled          37971      container:id.httpd3           f00d::a0f:0:0:c2e8   10.15.152.195   ready   
                                                           container:id.service1                                                      
62502      Enabled            Disabled          25083      container:id.httpd2           f00d::a0f:0:0:f426   10.15.83.26     ready   
                                                           container:id.service1                                                      

And indeed, bpf ipcache does not have entry for these IPs (from cilium-bpf-ipcache-list.md):

IP PREFIX/ADDRESS    IDENTITY   
f00d::a0f:0:0:1150   44408      
10.15.202.84         44408      
fc00::10ca:1         1          
10.15.28.238         1          
f00d::a0f:0:0:74ca   4319       
10.0.2.15            1          
10.15.242.54         4319       
f00d::a0f:0:0:8ad6   1          

But then the bpf ipcache seems to be missing all other endpoints as well.

I did not know how to interpret kvstore_get.txt, but it contains this one line matching the IPv4 address of httpd2 that can't be resolved to an ID:

cilium/state/ip/v1/default/10.15.83.26 => {"IP":"10.15.83.26","ID":25083,"Metadata":"cilium-global:default:runtime:62502"}
arepolicy kinbug priorithigh

Most helpful comment

@joestringer can we debug this together tomorrow?

All 20 comments

@joestringer can we debug this together tomorrow?

Logs from when test failed:

22:10:43 STEP: Client "app2" attempting to curl public URL on httpd2
22:10:43 time="2018-04-30T22:10:42Z" level=debug msg="running command: docker exec -i  app2 curl -s --fail --connect-timeout 5 --max-time 5 http://[f00d::a0f:0:0:f426]:80/public"
22:10:47 time="2018-04-30T22:10:47Z" level=debug msg="running command: sudo docker inspect app3"
22:10:47 time="2018-04-30T22:10:47Z" level=debug msg="running command: sudo docker inspect httpd2"
22:10:48 STEP: Client "app3" attempting to curl public URL on httpd2
22:10:48 time="2018-04-30T22:10:47Z" level=debug msg="running command: docker exec -i  app3 curl -s --fail --connect-timeout 5 --max-time 5 http://10.15.83.26:80/public"
22:10:54 time="2018-04-30T22:10:53Z" level=debug msg="running command: sudo journalctl -au cilium --since '123.008720135 seconds ago'"
22:10:54 ===================== TEST FAILED =====================

Policy was imported around this time:

22:10:10 time="2018-04-30T22:10:10Z" level=info msg="validating policy before importing" file-path=/home/vagrant/go/src/github.com/cilium/cilium/test/runtime/manifests/Policies-l3-policy.json test=RunPolicies
22:10:10 time="2018-04-30T22:10:10Z" level=debug msg="running command: sudo cilium policy validate /home/vagrant/go/src/github.com/cilium/cilium/test/runtime/manifests/Policies-l3-policy.json"
22:10:10 time="2018-04-30T22:10:10Z" level=debug msg="running command: sudo cilium policy import /home/vagrant/go/src/github.com/cilium/cilium/test/runtime/manifests/Policies-l3-policy.json"
22:10:16 time="2018-04-30T22:10:15Z" level=debug msg="running command: sudo cilium policy get -o json | jq '.revision'"
22:10:16 time="2018-04-30T22:10:16Z" level=debug msg="running command: sudo cilium policy wait 64"
22:10:21 time="2018-04-30T22:10:20Z" level=debug msg="running command: sudo cilium policy get -o json | jq '.revision'"
22:10:21 time="2018-04-30T22:10:20Z" level=info msg="policy import finished and revision increased" file-

httpd2 row in cilium endpoint list:

62502      Enabled            Disabled          25083      container:id.httpd2           f00d::a0f:0:0:f426   10.15.83.26     ready   
                                                           container:id.service1  

There are no more kvstore events for the ipcache received after 22:03:17:

$ grep "received event" cilium.log
Apr 30 22:02:22 runtime cilium-agent[17849]: level=debug msg="received event" key=cilium/state/ip/v1/default/10.15.129.219 kvstore-event=delete
Apr 30 22:02:35 runtime cilium-agent[17849]: level=debug msg="received event" key=cilium/state/ip/v1/default/10.15.202.84 kvstore-event=create
Apr 30 22:02:35 runtime cilium-agent[17849]: level=debug msg="received event" key="cilium/state/ip/v1/default/f00d::a0f:0:0:1150" kvstore-event=create
Apr 30 22:03:17 runtime cilium-agent[17849]: level=debug msg="received event" key=cilium/state/ip/v1/default/10.15.242.54 kvstore-event=modify
Apr 30 22:03:17 runtime cilium-agent[17849]: level=debug msg="received event" key="cilium/state/ip/v1/default/f00d::a0f:0:0:74ca" kvstore-event=modify

@joestringer and I wanted to see what the state of the key-value store watcher goroutine was, but unfortunately, the bugtool output for gops doesn't work:

Error while running 'gops stack $(pidof cilium-agent)': exit status 1

gops stack $(pidof cilium-agent)

Couldn't resolve addr or pid $(pidof to TCPAddress: couldn't parse PID: strconv.Atoi: parsing "$(pidof": invalid syntax

The addition of gops in bugtool seems to have not been tested. Fix for this coming soon, which will also require a backport.

We can see why the ipcache wasn't updated (events weren't received by the watcher), but we don't know what state the watcher was in at the time of failure. So, once we get the bugtool fix in, we can triage again accordingly with the stack trace. @jrajahalme does this reproduce consistently for you?

On May 1, 2018, at 3:18 PM, Ian Vernon notifications@github.com wrote:

httpd2 row in cilium endpoint list:

62502 Enabled Disabled 25083 container:id.httpd2 f00d::a0f:0:0:f426 10.15.83.26 ready
container:id.service1

app3 has an label-based L3/L4 egress policy, I think the drop happens there, before the packet gets to the ingress at all. Destination IP can not be mapped to the proper securiry ID because the destination IP is missing from the bpf ip map.

On May 1, 2018, at 3:44 PM, Ian Vernon notifications@github.com wrote:

We can see why the ipcache wasn't updated (events weren't received by the watcher), but we don't know what state the watcher was in at the time of failure. So, once we get the bugtool fix in, we can triage again accordingly with the stack trace. @jrajahalme https://github.com/jrajahalme does this reproduce consistently for you?

I have not seen this failure before and have not seen it reproduce either.

However, the same PR failed for the next build as well: https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated/2861/ https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated/2861/

But the failure is in a different test (egress to world (google.com http://google.com/). From build artifacts it seems the L3L4 test was not run due to test order randomization.

@ianvernon Was the kvstore_get.txt line I quoted in the end of the description something else and not relevant for this?

Do we have evidence that further events should have been received by the watcher?

@joestringer @ianvernon Could we apply controller or some other method to keep an eye on watcher health?

@jrajahalme the kvstore_get.txt line looks reasonable to me, it states that the relevant IP is being mapped to a particular Identity.

However we also noticed that the IP/identity mapping was reused after an earlier test.

I'm not 100% sure that further events should have been received by the watcher, it's plausible that the entry remains in the kvstore after the delete for a while due to underlying kvstore implementation. Not confident on this part though. (@tgraf)

@jrajahalme "Who watches the watcher", eh? ;) Sounds like a reasonable idea to me.

Hit this in #3835 too. There is evidence that the kvstore is updated with the relevant IP->ID mapping, however the IP->ID kvstore watcher does not appear to have received events for about five minutes. Monitor shows packets being dropped due to the identity mapping to the cluster ID rather than the endpoint IP's ID. Suspecting that in some cases, consul is not properly triggering create events.

Note that the PR #3835 adds CIDR-dependent L4, so the ipcache entries are extended with masks. One of the implications is that the "failure to map daddr" in the monitor becomes a success to map the daddr, to ID 3 (cluster). This is still wrong, it should map to an actual endpoint identity.

Link:
https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated/3029/

test_results_Cilium-PR-Ginkgo-Tests-Validated_3029.zip

Failure:

/home/jenkins/workspace/Cilium-PR-Ginkgo-Tests-Validated/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:312
"app3" expects curl public URL on httpd2 (10.15.77.25) to succeed
Expected
    <bool>: false
to be true
/home/jenkins/workspace/Cilium-PR-Ginkgo-Tests-Validated/src/github.com/cilium/cilium/test/runtime/Policies.go:377

Logs at time:

锟絒1mSTEP锟絒0m: Client "app3" attempting to curl public URL on httpd2
time="2018-05-09T00:42:35Z" level=debug msg="running command: docker exec -i  app3 curl -s --fail --connect-timeout 5 --max-time 5 http://10.15.77.25:80/public"

Monitor drop:

[2018-05-09 00:42:37] ------------------------------------------------------------------------------
[2018-05-09 00:42:37] CPU 01: MARK 0xae11567d FROM 62502 from-endpoint: 74 bytes, state new, identity 18020->0
[2018-05-09 00:42:37] Ethernet>.{Contents=[..14..] Payload=[..62..] SrcMAC=1e:5c:cb:a3:8e:3d DstMAC=a2:56:61:50:2e:74 EthernetType=IPv4 Length=0}
[2018-05-09 00:42:37] IPv4>.....{Contents=[..20..] Payload=[..40..] Version=4 IHL=5 TOS=0 Length=60 Id=32336 Flags=DF FragOffset=0 TTL=64 Protocol=TCP Checksum=2075 SrcIP=10.15.83.26 DstIP=10.15.77.25 Options=[] Padding=[]}
[2018-05-09 00:42:37] TCP>......{Contents=[..40..] Payload=[] SrcPort=54632 DstPort=80(http) Seq=1746970306 Ack=0 DataOffset=10 FIN=false SYN=true RST=false PSH=false ACK=false URG=false ECE=false CWR=false NS=false Window=28200 Checksum=46207 Urgent=0 Options=[..5..] Padding=[]}
[2018-05-09 00:42:37] CPU 01: MARK 0xae11567d FROM 62502 DEBUG: Conntrack lookup 1/2: src=10.15.83.26:54632 dst=10.15.77.25:80
[2018-05-09 00:42:37] CPU 01: MARK 0xae11567d FROM 62502 DEBUG: Conntrack lookup 2/2: nexthdr=6 flags=1
[2018-05-09 00:42:37] CPU 01: MARK 0xae11567d FROM 62502 DEBUG: CT verdict: New, revnat=0
[2018-05-09 00:42:37] CPU 01: MARK 0xae11567d FROM 62502 DEBUG: Successfully mapped daddr=10.15.77.25 to identity=3
[2018-05-09 00:42:37] CPU 01: MARK 0xae11567d FROM 62502 DEBUG: Policy evaluation would deny packet from 18020 to 3
[2018-05-09 00:42:37] ------------------------------------------------------------------------------
[2018-05-09 00:42:37] CPU 01: MARK 0xae11567d FROM 62502 DROP: 74 bytes, reason Policy denied (CIDR), to ifindex 0, identity 18020->0
[2018-05-09 00:42:37] Ethernet>.{Contents=[..14..] Payload=[..62..] SrcMAC=1e:5c:cb:a3:8e:3d DstMAC=a2:56:61:50:2e:74 EthernetType=IPv4 Length=0}
[2018-05-09 00:42:37] IPv4>.....{Contents=[..20..] Payload=[..40..] Version=4 IHL=5 TOS=0 Length=60 Id=32336 Flags=DF FragOffset=0 TTL=64 Protocol=TCP Checksum=2075 SrcIP=10.15.83.26 DstIP=10.15.77.25 Options=[] Padding=[]}
[2018-05-09 00:42:37] TCP>......{Contents=[..40..] Payload=[] SrcPort=54632 DstPort=80(http) Seq=1746970306 Ack=0 DataOffset=10 FIN=false SYN=true RST=false PSH=false ACK=false URG=false ECE=false CWR=false NS=false Window=28200 Checksum=46207 Urgent=0 Options=[..5..] Padding=[]} 

Cilium log for upsert:

$ grep 10.15.77.25 cilium.log | grep upsert
May 09 00:41:47 runtime cilium-agent[11337]: level=debug msg="upserting IP->ID mapping to kvstore" identity=26401 ipAddr=10.15.77.25 ipMask="<nil>" modification=Upsert

KVstore:

$ grep 10.15.77.25 kvstore_get.txt 
cilium/state/ip/v1/default/10.15.77.25 => {"IP":"10.15.77.25","Mask":null,"ID":26401,"Metadata":"cilium-global:default:runtime:28441"}

Logs are missing from the daemon for about 12 seconds at a relevant time

May 09 00:41:50 runtime cilium-agent[11337]: level=debug msg="Evaluating ingress context for source PolicyID" containerID=c510876817 endpointID=28441 ingress_context="{0 0 <nil> [cidr:9.9.0.0/19 cidr:0.0.0.0/1 cidr:0.0.0.0/3 cidr:9.8.0.0/15 cidr:9.9.0.0/17 cidr:9.9.9.0/28 reserved:world cidr:9.9.9.8/31 cidr:9.9.9.9/3#
May 09 00:41:50 runtime cilium-agent[11337]: level=debug msg="Evaluating egress context for source PolicyID" containerID=c510876817 egress_context="{0 0 <nil> [container:id.service1 container:id.httpd2] [cidr:9.9.0.0/19 cidr:0.0.0.0/1 cidr:0.0.0.0/3 cidr:9.8.0.0/15 cidr:9.9.0.0/17 cidr:9.9.9.0/28 reserved:world cidr:#
May 09 00:42:02 runtime cilium-agent[11337]: level=debug msg="GET /endpoint request" params="{HTTPRequest:0xc42178c900 Labels:[]}"
May 09 00:42:02 runtime cilium-agent[11337]: level=info msg="New monitor connected." count.listener=1

gops suggests that there have been no updates from the kvstore:

goroutine 345 [chan receive, 5 minutes]:                                        
github.com/cilium/cilium/pkg/ipcache.ipIdentityWatcher(0xc4216cd440, 0x2, 0x2)  
        /home/vagrant/go/src/github.com/cilium/cilium/pkg/ipcache/ipcache.go:508 +0x139
created by github.com/cilium/cilium/pkg/ipcache.InitIPIdentityWatcher.func1     
        /home/vagrant/go/src/github.com/cilium/cilium/pkg/ipcache/ipcache.go:643 +0x50 

Also interesting, another goroutine is blocking on kvstore operation:

goroutine 24148 [select, 3 minutes]:                                            
net/http.(*persistConn).roundTrip(0xc423f6cc60, 0xc42188c4b0, 0x0, 0x0, 0x0)    
        /usr/local/go/src/net/http/transport.go:1970 +0x608                     
net/http.(*Transport).RoundTrip(0xc4204e22d0, 0xc423a29b00, 0xc4204e22d0, 0x0, 0x0)
        /usr/local/go/src/net/http/transport.go:413 +0x992                      
net/http.send(0xc423a29b00, 0x2def520, 0xc4204e22d0, 0x0, 0x0, 0x0, 0xc420088ba0, 0x7ffff7fa8000, 0xc42407f890, 0x1)
        /usr/local/go/src/net/http/client.go:249 +0x1a9                         
net/http.(*Client).send(0xc422187b90, 0xc423a29b00, 0x0, 0x0, 0x0, 0xc420088ba0, 0x0, 0x1, 0x0)
        /usr/local/go/src/net/http/client.go:173 +0xfd                          
net/http.(*Client).Do(0xc422187b90, 0xc423a29b00, 0x2eb9400, 0x0, 0x0)          
        /usr/local/go/src/net/http/client.go:602 +0x28d                         
github.com/cilium/cilium/vendor/github.com/hashicorp/consul/api.(*Client).doRequest(0xc42217cea0, 0xc42333fbd0, 0xc42407fbc0, 0xc421ed80f0, 0x2e, 0xc42333fbd0)
        /home/vagrant/go/src/github.com/cilium/cilium/vendor/github.com/hashicorp/consul/api/api.go:368 +0xbe
github.com/cilium/cilium/vendor/github.com/hashicorp/consul/api.(*KV).deleteInternal(0xc42407fcc0, 0xc421ed80c0, 0x27, 0x0, 0x0, 0xc4204a0e00, 0x0, 0x0, 0x0)
        /home/vagrant/go/src/github.com/cilium/cilium/vendor/github.com/hashicorp/consul/api/kv.go:225 +0x296
github.com/cilium/cilium/vendor/github.com/hashicorp/consul/api.(*KV).Delete(0xc42407fcc0, 0xc421ed80c0, 0x27, 0x0, 0xc421ed80c0, 0x6331000000000027, 0x26)
        /home/vagrant/go/src/github.com/cilium/cilium/vendor/github.com/hashicorp/consul/api/kv.go:200 +0x52
github.com/cilium/cilium/pkg/kvstore.(*consulClient).Delete(0xc420088f88, 0xc421ed80c0, 0x27, 0x0, 0x0)
        /home/vagrant/go/src/github.com/cilium/cilium/pkg/kvstore/consul.go:483 +0x55
github.com/cilium/cilium/pkg/kvstore.Delete(0xc421ed80c0, 0x27, 0x3, 0xc421ed80c0)
        /home/vagrant/go/src/github.com/cilium/cilium/pkg/kvstore/kvstore.go:89 +0x56
github.com/cilium/cilium/pkg/ipcache.DeleteIPFromKVStore(0xc421f1beb4, 0xc, 0xc, 0xc42407fe50)                                                                                                                                                                                                                                 
        /home/vagrant/go/src/github.com/cilium/cilium/pkg/ipcache/ipcache.go:227 +0xbc
github.com/cilium/cilium/pkg/endpoint.(*Endpoint).runIPIdentitySync.func2(0xc42333fb30, 0xc42407ff28)
        /home/vagrant/go/src/github.com/cilium/cilium/pkg/endpoint/policy.go:930 +0x63
github.com/cilium/cilium/pkg/controller.(*Controller).runController(0xc423c9b200)
        /home/vagrant/go/src/github.com/cilium/cilium/pkg/controller/controller.go:233 +0x398
created by github.com/cilium/cilium/pkg/controller.(*Manager).UpdateController  
        /home/vagrant/go/src/github.com/cilium/cilium/pkg/controller/manager.go:81 +0x2c8

Unfortunately the consul log appears to be clean, no indication of errors in it.

@ianvernon to prepare a commit to wait for IPs to exist in the BPF datapath as a way to:

  • Catch eventual consistency issues
  • Cause the test to fail in this wait logic if the kvstore sync is misbehaving

Ticket tracking waiting for IPs to exist in BPF ipcache: https://github.com/cilium/cilium/issues/4055

Logs are missing from the daemon for about 12 seconds at a relevant time

This should hopefully be fixed by #4054 .

@tgraf do you have any recommendations on further triage we can do of this from the key-value store side? in the logs provided by @joestringer above, cilium status appears to be fine:

Cilium status

KVStore:                Ok         Consul: 172.17.0.3:8300
ContainerRuntime:       Ok         
Kubernetes:             Disabled   
Cilium:                 Ok         OK
NodeMonitor:            Listening for events on 2 CPUs with 64x4096 of shared memory
Cilium health daemon:   Ok   
IPv4 address pool:      9/65535 allocated
  10.15.28.238
  10.15.77.25
  10.15.83.26
  10.15.106.144
  10.15.152.195
  10.15.157.163
  10.15.195.104
  10.15.242.54
  10.15.247.232
IPv6 address pool:   8/65535 allocated
  f00d::a0f:0:0:44c7
  f00d::a0f:0:0:4646
  f00d::a0f:0:0:4870
  f00d::a0f:0:0:6f19
  f00d::a0f:0:0:74ca
  f00d::a0f:0:0:8ad6
  f00d::a0f:0:0:c2e8
  f00d::a0f:0:0:f426
Controller Status:   31/31 healthy
  Name                                 Last success   Last error   Count   Message
  ipcache-bpf-garbage-collection       3m23s ago      never        0       no error   
  kvstore-lease-keepalive              3m24s ago      never        0       no error   
  lxcmap-bpf-host-sync                 2s ago         never        0       no error   
  resolve-identity-17607               51s ago        never        0       no error   
  resolve-identity-17990               52s ago        never        0       no error   
  resolve-identity-18544               59s ago        never        0       no error   
  resolve-identity-28441               54s ago        never        0       no error   
  resolve-identity-29898               3m19s ago      never        0       no error   
  resolve-identity-49896               55s ago        never        0       no error   
  resolve-identity-62502               57s ago        never        0       no error   
  sync-IPv4-identity-mapping (17607)   51s ago        never        0       no error   
  sync-IPv4-identity-mapping (17990)   52s ago        never        0       no error   
  sync-IPv4-identity-mapping (18544)   59s ago        never        0       no error   
  sync-IPv4-identity-mapping (28441)   54s ago        never        0       no error   
  sync-IPv4-identity-mapping (29898)   3m19s ago      never        0       no error   
  sync-IPv4-identity-mapping (49896)   55s ago        never        0       no error   
  sync-IPv4-identity-mapping (62502)   57s ago        never        0       no error   
  sync-IPv6-identity-mapping (17607)   51s ago        never        0       no error   
  sync-IPv6-identity-mapping (17990)   52s ago        never        0       no error   
  sync-IPv6-identity-mapping (18544)   59s ago        never        0       no error   
  sync-IPv6-identity-mapping (28441)   54s ago        never        0       no error   
  sync-IPv6-identity-mapping (29898)   3m19s ago      never        0       no error   
  sync-IPv6-identity-mapping (49896)   55s ago        never        0       no error   
  sync-IPv6-identity-mapping (62502)   57s ago        never        0       no error   
  sync-identity-to-k8s-pod (17607)     51s ago        never        0       no error   
  sync-identity-to-k8s-pod (17990)     52s ago        never        0       no error   
  sync-identity-to-k8s-pod (18544)     59s ago        never        0       no error   
  sync-identity-to-k8s-pod (28441)     54s ago        never        0       no error   
  sync-identity-to-k8s-pod (29898)     19s ago        never        0       no error   
  sync-identity-to-k8s-pod (49896)     55s ago        never        0       no error   
  sync-identity-to-k8s-pod (62502)     57s ago        never        0       no error   
Proxy Status:   OK, ip 10.15.28.238, port-range 10000-20000

What are our options for checking Consul's health? Could we add a controller which does this? From the stacktrace above provided by Joe, Consul is stuck in the roundTrip function :/

Unassigning myself from this as we have RCA'd this to be an issue with Consul (or perhaps our code which uses the Consul library).

We haven't seen this failure in quite some time. Given that we have updated the version of Consul used in the CI, we will see if we hit this again. We can reopen if needed.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

lrouter picture lrouter  路  3Comments

tgraf picture tgraf  路  3Comments

aanm picture aanm  路  3Comments

twpayne picture twpayne  路  3Comments

tklauser picture tklauser  路  3Comments