Cilium: CI: xDS hosts cache has duplicate IP/policy entries for the host IP. (L7Policies test failed due to error configuring proxy redirects)

Created on 29 May 2018  路  40Comments  路  Source: cilium/cilium

Build link: https://jenkins.cilium.io/job/Ginkgo-CI-Tests-Pipeline/2628/testReport/junit/(root)/runtime/RuntimeValidatedPolicies_L7_Checks/

Stacktrace:

/home/jenkins/workspace/Ginkgo-CI-Tests-Pipeline/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:373
Expected
    <*errors.errorString | 0xc4201985a0>: {
        s: "Timeout reached: could not import policy",
    }
to be nil
/home/jenkins/workspace/Ginkgo-CI-Tests-Pipeline/src/github.com/cilium/cilium/vendor/github.com/onsi/gomega/internal/assertion/assertion.go:27

Stdout:

STEP: Setting PolicyEnforcement=default
锟絒1mSTEP锟絒0m: Setting up policy: /home/vagrant/go/src/github.com/cilium/cilium/test/runtime/manifests/Policies-l7-simple.json
STEP: Simple Ingress
STEP: Client "app1" attempting to "curl public URL on" "httpd1"
STEP: Client "app1" attempting to "curl public URL on" "httpd1"
STEP: Client "app1" attempting to "curl private URL on" "httpd1"
STEP: Client "app1" attempting to "curl private URL on" "httpd1"
STEP: Client "app2" attempting to "curl public URL on" "httpd1"
STEP: Client "app2" attempting to "curl public URL on" "httpd1"
STEP: Simple Egress
STEP: Client "app2" attempting to "curl public URL on" "httpd2"
STEP: Client "app2" attempting to "curl public URL on" "httpd2"
STEP: Client "app2" attempting to "curl private URL on" "httpd2"
STEP: Client "app2" attempting to "curl private URL on" "httpd2"
STEP: Disabling all the policies. All should work
STEP: Client "app1" attempting to "ping" "httpd1"
STEP: Client "app1" attempting to "ping" "httpd1"
STEP: Client "app1" attempting to "curl public URL on" "httpd1"
STEP: Client "app1" attempting to "curl public URL on" "httpd1"
STEP: Client "app1" attempting to "curl private URL on" "httpd1"
STEP: Client "app1" attempting to "curl private URL on" "httpd1"
STEP: Client "app2" attempting to "ping" "httpd1"
STEP: Client "app2" attempting to "ping" "httpd1"
STEP: Client "app2" attempting to "curl public URL on" "httpd1"
STEP: Client "app2" attempting to "curl public URL on" "httpd1"
STEP: Client "app2" attempting to "curl private URL on" "httpd1"
STEP: Client "app2" attempting to "curl private URL on" "httpd1"
STEP: Multiple Ingress
锟絒1mSTEP锟絒0m: Setting up policy: /home/vagrant/go/src/github.com/cilium/cilium/test/runtime/manifests/Policies-l7-multiple.json
===================== TEST FAILED =====================
cmd: sudo cilium endpoint list
ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])   IPv6                 IPv4            STATUS   
           ENFORCEMENT        ENFORCEMENT                                                                                     
10572      Disabled           Disabled          4          reserved:health               f00d::a0f:0:0:294c   10.15.91.32     ready       
16275      Enabled            Disabled          48404      container:id.httpd2           f00d::a0f:0:0:3f93   10.15.30.223    ready       
                                                           container:id.service1                                                          
31770      Disabled           Disabled          7322       container:id.app1             f00d::a0f:0:0:7c1a   10.15.25.236    ready       
40645      Disabled           Disabled          47704      container:id.app3             f00d::a0f:0:0:9ec5   10.15.169.53    ready       
42753      Disabled           Enabled           300        container:id.app2             f00d::a0f:0:0:a701   10.15.128.220   not-ready   
46043      Enabled            Disabled          5899       container:id.httpd1           f00d::a0f:0:0:b3db   10.15.200.94    not-ready   
                                                           container:id.service1                                                          
51876      Disabled           Disabled          49623      container:id.httpd3           f00d::a0f:0:0:caa4   10.15.244.249   ready       
                                                           container:id.service1                                                          

===================== EXITING REPORT GENERATION =====================

Logs:
9acb0442_RuntimeValidatedPolicies_L7_Checks.zip

Endpoints in not-ready state due to proxy-redirect configuration failure:

Endpoint Log 42753

Timestamp              Status    State                   Message
2018-05-29T14:47:55Z   Failure   ready                   Error regenerating endpoint: Error while configuring proxy redirects: proxy state changes failed: context deadline exceeded
2018-05-29T14:47:55Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-05-29T14:47:45Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
...
areCI kinbug priorithigh prioritrelease-blocker

All 40 comments

cc: @rlenglet @jrajahalme

May 29 14:47:45 runtime cilium-agent[8636]: level=debug msg="level=warning msg=\"external/envoy/source/common/config/grpc_mux_impl.cc:218] gRPC config for type.googleapis.com/envoy.api.v2.Listener update rejected: Error adding/updating listener 42753:egress:TCP:8080: cannot bind '[::]:15714': Address already in use\" subsys=envoy-upstream thread=20977"
May 29 14:47:45 runtime cilium-agent[8636]: level=debug msg="level=warning msg=\"bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_mux_subscription_lib/common/config/grpc_mux_subscription_impl.h:70] gRPC config for type.googleapis.com/envoy.api.v2.Listener rejected: Error adding/updating listener 42753:egress:TCP:8080: cannot bind '[::]:15714': Address already in use\" subsys=envoy-config thread=20977"
May 29 14:47:45 runtime cilium-agent[8636]: level=debug msg="received request from xDS stream" xdsClientNode="id:\"host~127.0.0.1~no-id~localdomain\" cluster:\"cluster1\" build_version:\"cbd93a042e08d55c286949d7282eace2cba2836a/1.7.0-dev/Modified/RELEASE\" " xdsNonce=6 xdsStreamID=1 xdsTypeURL=type.googleapis.com/envoy.api.v2.Listener xdsVersionInfo=6
May 29 14:47:45 runtime cilium-agent[8636]: level=warning msg="NACK received for version 10; waiting for a version update before sending again" xdsClientNode="id:\"host~127.0.0.1~no-id~localdomain\" cluster:\"cluster1\" build_version:\"cbd93a042e08d55c286949d7282eace2cba2836a/1.7.0-dev/Modified/RELEASE\" " xdsNonce=6 xdsStreamID=1 xdsTypeURL=type.googleapis.com/envoy.api.v2.Listener xdsVersionInfo=6

4056 should have fixed this?

Grepping for the port number:

May 29 14:47:45 runtime cilium-agent[8636]: level=debug msg="Created new http proxy instance" id="42753:egress:TCP:8080" obj="&{ProxyPort:15714 endpointID:42753 id:42753:egress:TCP:8080 ingress:false localEndpoint:0xc420217680 parserType:http created:{wall:13743710510708103467 ext:831416539703 loc:0x3c788a0} implementation:0xc4243563e0 mutex:{internalRWMutex:{RWMutex:{mu:{w:{state:0 sema:0} writerSem:0 readerSem:0 readerCount:0 readerWait:0}}}} lastUpdated:{wall:13743710510708103560 ext:831416539787 loc:0x3c788a0} rules:map[{LabelSelector:0xc420376c00}:{HTTP:[{Path:/public Method:GET Host: Headers:[]}] Kafka:[]}]}"
May 29 14:47:45 runtime cilium-agent[8636]: level=debug msg="level=warning msg=\"external/envoy/source/common/config/grpc_mux_impl.cc:218] gRPC config for type.googleapis.com/envoy.api.v2.Listener update rejected: Error adding/updating listener 42753:egress:TCP:8080: cannot bind '[::]:15714': Address already in use\" subsys=envoy-upstream thread=20977"
May 29 14:47:45 runtime cilium-agent[8636]: level=debug msg="level=warning msg=\"bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_mux_subscription_lib/common/config/grpc_mux_subscription_impl.h:70] gRPC config for type.googleapis.com/envoy.api.v2.Listener rejected: Error adding/updating listener 42753:egress:TCP:8080: cannot bind '[::]:15714': Address already in use\" subsys=envoy-config thread=20977"
May 29 14:47:45 runtime cilium-agent[8636]: level=debug msg="level=warning msg=\"external/envoy/source/common/config/grpc_mux_impl.cc:218] gRPC config for type.googleapis.com/envoy.api.v2.Listener update rejected: Error adding/updating listener 42753:egress:TCP:8080: cannot bind '[::]:15714': Address already in use\" subsys=envoy-upstream thread=20977"
May 29 14:47:45 runtime cilium-agent[8636]: level=debug msg="level=warning msg=\"bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_mux_subscription_lib/common/config/grpc_mux_subscription_impl.h:70] gRPC config for type.googleapis.com/envoy.api.v2.Listener rejected: Error adding/updating listener 42753:egress:TCP:8080: cannot bind '[::]:15714': Address already in use\" subsys=envoy-config thread=20977"

Seems like this was the first allocation of that port, at least in the logs available.

Bugtool newtstat--a.md shows nothing for port 15714.

This is a master build on which the last commit updated the kernel version via a new packer box version (77). Could it be possible that there is a race between the socket.Close() we call from cilium-agent and the socket actually being available for new bind() call from Envoy right after, even when both call sites set the SO_REUSEADDR socket option?

Has the OS been updated with the Vagrant box, besides the kernel? I would check that the range of local ports allocated by Linux hasn't changed.

sysctl net.ipv4.ip_local_port_range

If that overlaps now with the proxy redirect port range configured in Cilium, we need to change that range configuration.

I hit this issue locally and have the environment for it preserved.

@ianvernon I need to restart with the new box version.
In the meantime, could you please run sysctl net.ipv4.ip_local_port_range and post the output?

@rlenglet :

vagrant@runtime:~$ sysctl net.ipv4.ip_local_port_range
net.ipv4.ip_local_port_range = 32768    60999

Ok, that looks fine.

This was hit in a CI test run with debug logs enabled.

Link: https://jenkins.cilium.io/job/Ginkgo-CI-Tests-Pipeline/2653/testReport/junit/(root)/runtime/RuntimeValidatedLB_Services_Policies_tests_with_service_performing_L4_port_mapping/

Stacktrace:

/home/jenkins/workspace/Ginkgo-CI-Tests-Pipeline/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:373
Expected
    <*errors.errorString | 0xc42037c3f0>: {
        s: "Timeout reached: could not import policy",
    }
to be nil
/home/jenkins/workspace/Ginkgo-CI-Tests-Pipeline/src/github.com/cilium/cilium/vendor/github.com/onsi/gomega/internal/assertion/assertion.go:27

Endpoint was in not-ready state:

Setting up policy: /home/vagrant/go/src/github.com/cilium/cilium/test/runtime/manifests/Policies-l7-multiple.json
===================== TEST FAILED =====================
cmd: sudo cilium endpoint list
ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])   IPv6                 IPv4            STATUS   
           ENFORCEMENT        ENFORCEMENT                                                                                     
5194       Disabled           Disabled          33846      container:id.app1             f00d::a0f:0:0:144a   10.15.236.185   ready       
19041      Disabled           Enabled           15862      container:id.app2             f00d::a0f:0:0:4a61   10.15.253.22    ready       
33652      Disabled           Disabled          4          reserved:health               f00d::a0f:0:0:8374   10.15.175.138   ready       
39028      Enabled            Disabled          9971       container:id.httpd1           f00d::a0f:0:0:9874   10.15.11.153    not-ready   
                                                           container:id.service1                                                          
56104      Disabled           Disabled          20799      container:id.httpd3           f00d::a0f:0:0:db28   10.15.254.46    ready       
                                                           container:id.service1                                                          
62489      Enabled            Disabled          2546       container:id.httpd2           f00d::a0f:0:0:f419   10.15.191.26    ready       
                                                           container:id.service1                                                          
63771      Disabled           Disabled          61329      container:id.app3             f00d::a0f:0:0:f91b   10.15.90.89     ready    

Logs: 9acb0442_RuntimeValidatedLB_Services_Policies_tests_with_service_performing_L4_port_mapping.zip

From the bugtool output in the attached logs, Debug is enabled for the daemon:

cilium config

Conntrack                Enabled
ConntrackAccounting      Enabled
ConntrackLocal           Disabled
Debug                    Enabled
DropNotification         Enabled
PolicyTracing            Disabled
TraceNotification        Enabled
k8s-configuration        
k8s-endpoint             
PolicyEnforcement        default
MonitorNumPages          64

From the endpoint log for the endpoint which was in not-ready state, proxy redirects were unable to be configured:

Endpoint Log 39028

Timestamp              Status    State                   Message
2018-05-31T17:08:42Z   Failure   ready                   Error regenerating endpoint: Error while configuring proxy redirects: proxy state changes failed: context deadline exceeded
2018-05-31T17:08:42Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-05-31T17:08:32Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
...

Found another issue:

https://jenkins.cilium.io/job/Ginkgo-CI-Tests-Pipeline/2653/

Endpoint Log 39028

Timestamp              Status    State                   Message
2018-05-31T17:08:42Z   Failure   ready                   Error regenerating endpoint: Error while configuring proxy redirects: proxy state changes failed: context deadline exceeded
2018-05-31T17:08:42Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-05-31T17:08:32Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-05-31T17:08:32Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-05-31T17:08:31Z   OK        ready                   Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-05-31T17:08:31Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-05-31T17:08:29Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-05-31T17:08:29Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-05-31T17:08:20Z   OK        ready                   Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-05-31T17:08:20Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-05-31T17:08:17Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-05-31T17:08:17Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-05-31T17:08:10Z   OK        ready                   Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-05-31T17:08:10Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-05-31T17:08:08Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-05-31T17:08:08Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-05-31T17:07:59Z   OK        ready                   Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-05-31T17:07:59Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-05-31T17:07:57Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-05-31T17:07:57Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-05-31T17:07:55Z   OK        ready                   Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-05-31T17:07:55Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-05-31T17:07:53Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-05-31T17:07:53Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-05-31T17:07:44Z   OK        ready                   Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-05-31T17:07:44Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-05-31T17:07:42Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-05-31T17:07:42Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-05-31T17:07:35Z   OK        ready                   Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-05-31T17:07:35Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-05-31T17:07:35Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-05-31T17:07:35Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-05-31T17:07:33Z   OK        ready                   Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-05-31T17:07:33Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-05-31T17:07:32Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-05-31T17:07:32Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-05-31T17:07:22Z   OK        ready                   Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-05-31T17:07:22Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-05-31T17:07:20Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-05-31T17:07:20Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-05-31T17:07:18Z   OK        ready                   Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-05-31T17:07:18Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-05-31T17:07:16Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-05-31T17:07:16Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-05-31T17:07:07Z   OK        ready                   Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-05-31T17:07:07Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-05-31T17:07:05Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-05-31T17:07:05Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-05-31T17:06:58Z   OK        ready                   Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-05-31T17:06:58Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-05-31T17:06:58Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-05-31T17:06:58Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-05-31T17:06:58Z   OK        ready                   Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-05-31T17:06:58Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-05-31T17:06:58Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-05-31T17:06:57Z   OK        waiting-to-regenerate   Successfully regenerated endpoint program due to updated security labels
2018-05-31T17:06:57Z   Warning   waiting-to-regenerate   Skipped invalid state transition to ready due to: Completed endpoint regeneration with no pending regeneration requests
2018-05-31T17:06:56Z   Warning   waiting-to-regenerate   Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-05-31T17:06:56Z   Warning   waiting-to-regenerate   Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-05-31T17:06:55Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-05-31T17:06:54Z   OK        regenerating            Regenerating Endpoint BPF: updated security labels
2018-05-31T17:06:54Z   Warning   waiting-to-regenerate   Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-05-31T17:06:54Z   OK        waiting-to-regenerate   Successfully regenerated endpoint program due to updated security labels
2018-05-31T17:06:54Z   Warning   waiting-to-regenerate   Skipped invalid state transition to ready due to: Completed endpoint regeneration with no pending regeneration requests
2018-05-31T17:06:53Z   OK        waiting-to-regenerate   Triggering regeneration due to new identity
2018-05-31T17:06:52Z   OK        regenerating            Regenerating Endpoint BPF: updated security labels
2018-05-31T17:06:52Z   OK        waiting-to-regenerate   Triggering regeneration due to new identity
2018-05-31T17:06:52Z   OK        ready                   Set identity for this endpoint

Endpoint Get 39028

[
  {
    "id": 39028,
    "spec": {
      "label-configuration": {
        "user": []
      },
      "options": {
        "Conntrack": "Enabled",
        "ConntrackAccounting": "Enabled",
        "ConntrackLocal": "Disabled",
        "Debug": "Enabled",
        "DebugLB": "Disabled",
        "DropNotification": "Enabled",
        "EgressPolicy": "Disabled",
        "IngressPolicy": "Enabled",
        "NAT46": "Disabled",
        "TraceNotification": "Enabled"
      }
    },
    "status": {
      "controllers": [
        {
          "configuration": {
            "error-retry": true,
            "interval": "5m0s"
          },
          "name": "resolve-identity-39028",
          "status": {
            "last-failure-timestamp": "0001-01-01T00:00:00.000Z",
            "last-success-timestamp": "2018-05-31T17:11:53.049Z",
            "success-count": 2
          },
          "uuid": "03ed0715-64f5-11e8-b6bd-080027aa233b"
        },
        {
          "configuration": {
            "error-retry": true,
            "interval": "5m0s"
          },
          "name": "sync-IPv4-identity-mapping (39028)",
          "status": {
            "last-failure-timestamp": "0001-01-01T00:00:00.000Z",
            "last-success-timestamp": "2018-05-31T17:11:53.061Z",
            "success-count": 2
          },
          "uuid": "03ed9611-64f5-11e8-b6bd-080027aa233b"
        },
        {
          "configuration": {
            "error-retry": true,
            "interval": "5m0s"
          },
          "name": "sync-IPv6-identity-mapping (39028)",
          "status": {
            "last-failure-timestamp": "0001-01-01T00:00:00.000Z",
            "last-success-timestamp": "2018-05-31T17:11:53.062Z",
            "success-count": 2
          },
          "uuid": "03ed9724-64f5-11e8-b6bd-080027aa233b"
        },
        {
          "configuration": {
            "error-retry": true,
            "interval": "1m0s"
          },
          "name": "sync-identity-to-k8s-pod (39028)",
          "status": {
            "last-failure-timestamp": "0001-01-01T00:00:00.000Z",
            "last-success-timestamp": "2018-05-31T17:13:53.049Z",
            "success-count": 8
          },
          "uuid": "03ed94e5-64f5-11e8-b6bd-080027aa233b"
        },
        {
          "configuration": {
            "error-retry": true,
            "interval": "1m0s"
          },
          "name": "sync-policymap-39028",
          "status": {
            "last-failure-timestamp": "0001-01-01T00:00:00.000Z",
            "last-success-timestamp": "2018-05-31T17:14:32.859Z",
            "success-count": 7
          },
          "uuid": "3f660389-64f5-11e8-b6bd-080027aa233b"
        }
      ],
      "external-identifiers": {
        "container-id": "c4da6d8033619a83448bcc97cc935df4c1b86936008153d502b0ac28c6200148",
        "container-name": "httpd1",
        "docker-endpoint-id": "32531d626fc09c96cb85a7cf6eb86b6d897d52b4a78a3ddb527a3cc0c3ecc40f",
        "docker-network-id": "30712b8c887fef9688f10ba73d00081a32b19ead87d5feaad282559314289c78",
        "pod-name": ":"
      },
      "health": {
        "bpf": "Warning",
        "connected": true,
        "overallHealth": "Warning",
        "policy": "Warning"
      },
      "identity": {
        "id": 9971,
        "labels": [
          "container:id.httpd1",
          "container:id.service1"
        ],
        "labelsSHA256": "66a311c75290026d8ed6035cdb9dabddbc9336152aa6aadbdbbbbbf5e0cca9f5"
      },
      "labels": {
        "derived": [],
        "disabled": [],
        "realized": {
          "user": []
        },
        "security-relevant": [
          "container:id.httpd1",
          "container:id.service1"
        ]
      },
      "log": [
        {
          "code": "Failure",
          "message": "Error regenerating endpoint: Error while configuring proxy redirects: proxy state changes failed: context deadline exceeded",
          "state": "ready",
          "timestamp": "2018-05-31T17:08:42Z"
        }
      ],
      "networking": {
        "addressing": [
          {
            "ipv4": "10.15.11.153",
            "ipv6": "f00d::a0f:0:0:9874"
          }
        ],
        "host-mac": "02:c5:2c:41:4d:e8",
        "interface-index": 65,
        "interface-name": "lxc32531",
        "mac": "b2:ea:f0:8d:66:03"
      },
      "policy": {
        "proxy-statistics": [
          {
            "allocated-proxy-port": 15054,
            "location": "ingress",
            "port": 80,
            "protocol": "http",
            "statistics": {
              "requests": {
                "denied": 2,
                "forwarded": 5,
                "received": 7
              },
              "responses": {
                "forwarded": 5,
                "received": 5
              }
            }
          },
          {
            "allocated-proxy-port": 11766,
            "location": "ingress",
            "port": 8080,
            "protocol": "http",
            "statistics": {
              "requests": {},
              "responses": {}
            }
          }
        ],
        "realized": {
          "allowed-egress-identities": [
            2,
            2546,
            33846,
            5,
            20799,
            3,
            15862,
            61329,
            43088,
            1,
            4,
            9971
          ],
          "allowed-ingress-identities": [
            1
          ],
          "build": 24,
          "cidr-policy": {
            "egress": [],
            "ingress": []
          },
          "id": 9971,
          "l4": {
            "egress": [],
            "ingress": []
          },
          "policy-enabled": "ingress",
          "policy-revision": 24
        },
        "spec": {
          "allowed-egress-identities": [
            43088,
            9971,
            33846,
            2546,
            15862,
            1,
            4,
            5,
            2,
            3,
            61329,
            20799
          ],
          "allowed-ingress-identities": [
            1
          ],
          "build": 24,
          "cidr-policy": {
            "egress": [],
            "ingress": []
          },
          "id": 9971,
          "l4": {
            "egress": [],
            "ingress": [
              {
                "derived-from-rules": [
                  []
                ],
                "rule": {
            "l7-rules": [
              {
                "any.id.app1=": {
                  "http": [
                    {
                      "method": "GET",
                      "path": "/public"
                    }
                  ]
                }
              },
              {
                "reserved.host=": {
                  "http": [
                    {
                      "method": "GET",
                      "path": "/public"
                    }
                  ]
                }
              }
            ],
            "port": 80,
            "protocol": "TCP"
          }
              },
              {
                "derived-from-rules": [
                  []
                ],
                "rule": {
            "l7-rules": [
              {
                "any.id.app1=": {
                  "http": [
                    {
                      "method": "GET",
                      "path": "/public"
                    }
                  ]
                }
              },
              {
                "reserved.host=": {
                  "http": [
                    {
                      "method": "GET",
                      "path": "/public"
                    }
                  ]
                }
              }
            ],
            "port": 8080,
            "protocol": "TCP"
          }
              }
            ]
          },
          "policy-enabled": "ingress",
          "policy-revision": 24
        }
      },
      "realized": {
        "label-configuration": {
          "user": []
        },
        "options": {
          "Conntrack": "Enabled",
          "ConntrackAccounting": "Enabled",
          "ConntrackLocal": "Disabled",
          "Debug": "Enabled",
          "DebugLB": "Disabled",
          "DropNotification": "Enabled",
          "EgressPolicy": "Disabled",
          "IngressPolicy": "Enabled",
          "NAT46": "Disabled",
          "TraceNotification": "Enabled"
        }
      },
      "state": "not-ready"
    }
  }
]

BPF Policy Get 39028

DIRECTION   LABELS (source:key[=value])   PORT/PROTO   BYTES   PACKETS   
Ingress     container:id.app1             80/TCP       0       0         
Egress      reserved:init                 ANY          0       0         
Ingress     container:id.app1             8080/TCP     0       0         
Egress      container:id.httpd3           ANY          0       0         
            container:id.service1                                             
Egress      container:id.httpd2           ANY          0       0         
            container:id.service1                                             
Egress      container:id.client           ANY          0       0         
Egress      reserved:world                ANY          0       0         
Egress      container:id.app2             ANY          0       0         
Egress      container:id.httpd1           ANY          0       0         
            container:id.service1                                             
Egress      reserved:host                 ANY          0       0         
Ingress     reserved:host                 ANY          0       0         
Egress      container:id.app3             ANY          0       0         
Egress      reserved:cluster              ANY          0       0         
Egress      container:id.app1             ANY          0       0         
Egress      reserved:health               ANY          0       0         

All logs attached.
9acb0442_RuntimeValidatedLB_Services_Policies_tests_with_service_performing_L4_port_mapping.zip

Another issue:

Endpoint Health 17455

Overall Health:   Warning
BPF Health:       Warning
Policy Health:    Warning
Connected:        yes

Endpoint Log 17455

Timestamp              Status    State                   Message
2018-06-01T13:24:17Z   Failure   ready                   Error regenerating endpoint: Error while configuring proxy redirects: proxy state changes failed: context deadline exceeded
2018-06-01T13:24:17Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-06-01T13:24:07Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-01T13:24:07Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-06-01T13:24:01Z   OK        ready                   Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-01T13:24:01Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-06-01T13:24:01Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-01T13:24:01Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-06-01T13:23:57Z   OK        ready                   Endpoint policy update skipped because no changes were needed
2018-06-01T13:23:46Z   OK        ready                   Endpoint policy update skipped because no changes were needed
2018-06-01T13:23:44Z   OK        ready                   Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-01T13:23:44Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-06-01T13:23:42Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-01T13:23:42Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-06-01T13:23:34Z   OK        ready                   Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-01T13:23:34Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-06-01T13:23:31Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-01T13:23:31Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-06-01T13:23:25Z   OK        ready                   Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-01T13:23:25Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2018-06-01T13:23:25Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-01T13:23:25Z   Warning   waiting-to-regenerate   Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-01T13:23:24Z   OK        waiting-to-regenerate   Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-01T13:23:24Z   Warning   waiting-to-regenerate   Skipped invalid state transition to ready due to: Completed endpoint regeneration with no pending regeneration requests
2018-06-01T13:23:22Z   Warning   waiting-to-regenerate   Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-01T13:23:22Z   Warning   waiting-to-regenerate   Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-01T13:23:21Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-06-01T13:23:21Z   OK        regenerating            Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-01T13:23:21Z   OK        waiting-to-regenerate   Successfully regenerated endpoint program due to updated security labels
2018-06-01T13:23:21Z   Warning   waiting-to-regenerate   Skipped invalid state transition to ready due to: Completed endpoint regeneration with no pending regeneration requests
2018-06-01T13:23:19Z   Warning   waiting-to-regenerate   Skipped invalid state transition to waiting-to-regenerate due to: Triggering regeneration due to new identity
2018-06-01T13:23:19Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to policy updates
2018-06-01T13:23:19Z   OK        regenerating            Regenerating Endpoint BPF: updated security labels
2018-06-01T13:23:19Z   OK        waiting-to-regenerate   Triggering regeneration due to new identity
2018-06-01T13:23:19Z   OK        ready                   Set identity for this endpoint

https://jenkins.cilium.io/job/Ginkgo-CI-Tests-Pipeline/2663/
5c6211b5_RuntimeValidatedLB_Services_Policies_tests_with_conntrack_disabled.zip

 2018-06-01T16:38:38.319642482Z level=warning msg="NACK received for version 48; waiting for a version update before sending again" xdsClientNode="id:\"host~127.0.0.1~no-id~localdomain\" cluster:\"cluster1\" build_version:\"d9c3421c784e0874297301d77308a2bd784e6e7b/1.7.0-dev/Distribution/RELEASE\" " xdsNonce=2 xdsStreamID=2 xdsTypeURL=type.googleapis.com/cilium.NetworkPolicyHosts xdsVersionInfo=47

https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-K8s/291/testReport/k8s-1/9/K8sValidatedUpdates_Updating_Cilium_stable_to_master/

One significant piece of info here is that the error happens with NetworkPolicyHosts i.e., CIDR/policyID map update:

    2018-06-01T16:38:38.319546387Z level=debug msg="level=warning msg=\"bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_mux_subscription_lib/common/config/grpc_mux_subscription_impl.h:70] gRPC config for type.googleapis.com/cilium.NetworkPolicyHosts rejected: NetworkPolicyHosts: Duplicate host entry '10.10.0.44' for policy 4, already mapped to 1\" subsys=envoy-config thread=374"

NetworkPolicyHosts: Duplicate host entry '10.10.0.44' for policy 4, already mapped to 1

cilium bpf ipcache list shows this:

10.10.0.44/32              1          

So the mapping to 4 is an inconsistency in how we update xDS cache, apparently the address was previously mapped to 4 and then to 1, but that update was not properly done on the xDS cache.

This IP is the node IP:

    2018-06-01T16:38:32.10583474Z level=info msg="  Internal-Node IPv4: 10.10.0.44"

Which first has the ID 1 (HOST):

    2018-06-01T16:38:33.305888014Z level=debug msg="daemon notified of IP-Identity cache state change" identity=host ipAddr=10.10.0.44 ipMask="<nil>" modification=Upsert

Then etcd tells us the ID is 4:

2018-06-01T16:38:33.314993027Z level=debug msg="List response from etcd len=36: &{Header:cluster_id:3373127551666285087 member_id:13506963981489885289 revision:1206 raft_term:3  Kvs:[
...
{\\\"IP\\\":\\\"10.10.0.44\\\",\\\"Mask\\\":null,\\\"ID\\\":4,
    2018-06-01T16:38:33.322006363Z level=debug msg="received event" key=cilium/state/ip/v1/default/10.10.0.44 kvstore-event=create
    2018-06-01T16:38:33.322050482Z level=debug msg="Removing from ipcache layer" ipAddr=10.10.0.44
    2018-06-01T16:38:33.322069008Z level=debug msg="Upserting into ipcache layer" identity=health ipAddr=10.10.0.44
    2018-06-01T16:38:33.322511441Z level=debug msg="endpoint IP cache state change" cached-identity=host identity=health ipAddr=10.10.0.44 ipMask="<nil>" modification=Upsert
    2018-06-01T16:38:33.322525425Z level=debug msg="1 resource names requested, filtering resources" xdsClientNode="<nil>" xdsTypeURL=type.googleapis.com/cilium.NetworkPolicyHosts xdsVersionInfo="<nil>"
    2018-06-01T16:38:33.322529765Z level=debug msg="resource found, last modified in version 4" xdsClientNode="<nil>" xdsResourceName=1 xdsTypeURL=type.googleapis.com/cilium.NetworkPolicyHosts xdsVersionInfo="<nil>"
    2018-06-01T16:38:33.322533424Z level=debug msg="returning 1 resources out of 1 requested" xdsClientNode="<nil>" xdsTypeURL=type.googleapis.com/cilium.NetworkPolicyHosts xdsVersionInfo="<nil>"
    2018-06-01T16:38:33.322547617Z level=debug msg="preparing new cache transaction: upserting 1 entries, deleting 0 entries" xdsTypeURL=type.googleapis.com/cilium.NetworkPolicyHosts xdsVersionInfo=16
    2018-06-01T16:38:33.322577082Z level=debug msg="updating resource in cache" xdsResourceName=1 xdsTypeURL=type.googleapis.com/cilium.NetworkPolicyHosts xdsVersionInfo=16
    2018-06-01T16:38:33.322583364Z level=debug msg="committing cache transaction and notifying of new version" xdsTypeURL=type.googleapis.com/cilium.NetworkPolicyHosts xdsVersionInfo=16

Next we use this address for cilium-health:

    2018-06-01T16:38:34.84656439Z level=info msg="Adding route" command="ip route add 10.10.0.44/32 dev cilium" netns=cilium-health
    2018-06-01T16:38:34.84663536Z level=info msg="Adding route" command="ip route add 0.0.0.0/0 via 10.10.0.44 dev cilium" netns=cilium-health
    2018-06-01T16:38:34.846640304Z level=info msg="Adding route" command="ip -6 route add f00d::a0a:0:0:8ad6/128 dev cilium" netns=cilium-health
    2018-06-01T16:38:34.846668631Z level=info msg="Adding route" command="ip -6 route add ::/0 via f00d::a0a:0:0:8ad6 dev cilium" netns=cilium-health
    2018-06-01T16:38:34.846706547Z level=debug msg="Running \"ip [netns exec cilium-health bash -c ip route add 10.10.0.44/32 dev cilium && ip route add 0.0.0.0/0 via 10.10.0.44 dev cilium && ip -6 route add f00d::a0a:0:0:8ad6/128 dev cilium && ip -6 route add ::/0 via f00d::a0a:0:0:8ad6 dev cilium]\""

And then this happens:

    2018-06-01T16:38:38.316456561Z level=debug msg="Adding special identity to ipcache" ipAddr=10.10.0.44
    2018-06-01T16:38:38.316834134Z level=debug msg="Removing from ipcache layer" ipAddr=10.10.0.44
    2018-06-01T16:38:38.316931843Z level=debug msg="Upserting into ipcache layer" identity=host ipAddr=10.10.0.44
    2018-06-01T16:38:38.316984338Z level=debug msg="1 resource names requested, filtering resources" xdsClientNode="<nil>" xdsTypeURL=type.googleapis.com/cilium.NetworkPolicyHosts xdsVersionInfo="<nil>"
    2018-06-01T16:38:38.31703281Z level=debug msg="resource found, last modified in version 16" xdsClientNode="<nil>" xdsResourceName=1 xdsTypeURL=type.googleapis.com/cilium.NetworkPolicyHosts xdsVersionInfo="<nil>"
    2018-06-01T16:38:38.317053351Z level=debug msg="returning 1 resources out of 1 requested" xdsClientNode="<nil>" xdsTypeURL=type.googleapis.com/cilium.NetworkPolicyHosts xdsVersionInfo="<nil>"
    2018-06-01T16:38:38.317105577Z level=debug msg="preparing new cache transaction: upserting 1 entries, deleting 0 entries" xdsTypeURL=type.googleapis.com/cilium.NetworkPolicyHosts xdsVersionInfo=48
    2018-06-01T16:38:38.317145543Z level=debug msg="updating resource in cache" xdsResourceName=1 xdsTypeURL=type.googleapis.com/cilium.NetworkPolicyHosts xdsVersionInfo=48
    2018-06-01T16:38:38.317181418Z level=debug msg="committing cache transaction and notifying of new version" xdsTypeURL=type.googleapis.com/cilium.NetworkPolicyHosts xdsVersionInfo=48
    2018-06-01T16:38:38.317224729Z level=debug msg="daemon notified of IP-Identity cache state change" identity=host ipAddr=10.10.0.44 ipMask="<nil>" modification=Upsert
    2018-06-01T16:38:38.317368896Z level=debug msg="getting 0 resources from set" xdsClientNode="id:\"host~127.0.0.1~no-id~localdomain\" cluster:\"cluster1\" build_version:\"d9c3421c784e0874297301d77308a2bd784e6e7b/1.7.0-dev/Distribution/RELEASE\" " xdsTypeURL=type.googleapis.com/cilium.NetworkPolicyHosts xdsVersionInfo=0xc421dfd158
    2018-06-01T16:38:38.317429352Z level=debug msg="no resource names requested, returning all 11 resources" xdsClientNode="id:\"host~127.0.0.1~no-id~localdomain\" cluster:\"cluster1\" build_version:\"d9c3421c784e0874297301d77308a2bd784e6e7b/1.7.0-dev/Distribution/RELEASE\" " xdsTypeURL=type.googleapis.com/cilium.NetworkPolicyHosts xdsVersionInfo=0xc421dfd158
    2018-06-01T16:38:38.317608525Z level=debug msg="sending xDS response with 11 resources" xdsCanary=false xdsNonce=2 xdsStreamID=2 xdsTypeURL=type.googleapis.com/cilium.NetworkPolicyHosts xdsVersionInfo=48
    2018-06-01T16:38:38.319478263Z level=debug msg="level=warning msg=\"external/envoy/source/common/config/grpc_mux_impl.cc:218] gRPC config for type.googleapis.com/cilium.NetworkPolicyHosts update rejected: NetworkPolicyHosts: Duplicate host entry '10.10.0.44' for policy 4, already mapped to 1\" subsys=envoy-upstream thread=374"
    2018-06-01T16:38:38.319546387Z level=debug msg="level=warning msg=\"bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_mux_subscription_lib/common/config/grpc_mux_subscription_impl.h:70] gRPC config for type.googleapis.com/cilium.NetworkPolicyHosts rejected: NetworkPolicyHosts: Duplicate host entry '10.10.0.44' for policy 4, already mapped to 1\" subsys=envoy-config thread=374"
    2018-06-01T16:38:38.31955499Z level=debug msg="level=warning msg=\"cilium_host_map.cc:186] Bad NetworkPolicyHosts Configuration\" subsys=envoy-config thread=374"
    2018-06-01T16:38:38.319567388Z level=debug msg="received request from xDS stream" xdsClientNode="id:\"host~127.0.0.1~no-id~localdomain\" cluster:\"cluster1\" build_version:\"d9c3421c784e0874297301d77308a2bd784e6e7b/1.7.0-dev/Distribution/RELEASE\" " xdsNonce=2 xdsStreamID=2 xdsTypeURL=type.googleapis.com/cilium.NetworkPolicyHosts xdsVersionInfo=47
    2018-06-01T16:38:38.319642482Z level=warning msg="NACK received for version 48; waiting for a version update before sending again" xdsClientNode="id:\"host~127.0.0.1~no-id~localdomain\" cluster:\"cluster1\" build_version:\"d9c3421c784e0874297301d77308a2bd784e6e7b/1.7.0-dev/Distribution/RELEASE\" " xdsNonce=2 xdsStreamID=2 xdsTypeURL=type.googleapis.com/cilium.NetworkPolicyHosts xdsVersionInfo=47

@joestringer Need your view on this, may relate to how "health" is reached and/or to other recent host/health reserved ID changes (what does '4' stand for?).

@jrajahalme 4 is the reserved identity for reserved:health.

So we try to associate the host IP to both 1 (reserved:host) and 4 (reserved:health). Which kind of makes sense, since they both use the same IP.

There's no easy solution here. :/

One question - do we need to send the health IP to XDS? When will a user configure policy for health vs. the host?

It鈥檒l be needed to enforce policies to/from health daemon.

Or we can decide that it should get identity 1 / host instead of a custom id.

Note that bpf ipcache also can only have one mapping for an IP address. In this case it is 1 (see comment above).

It鈥檒l be needed to enforce policies to/from health daemon.

Is this a valid use case? What I'm saying is that since health is a 'special' endpoint, maybe we can have special checks for it by not allowing policy to apply to it, and not to send it to the XDS cache either.

Yes it鈥檚 a valid use case. We鈥檒l want to exercise policy enforcement to check the health of the Cilium agent.

But my point is that we don鈥檛 necessarily need to give the health daemon ID 4.
We could give it ID 1.

The health endpoint has its own IP, why do we associate it with the host IP?

Note that in the section above, I quote:

Next we use this address for cilium-health:
2018-06-01T16:38:34.84656439Z level=info msg="Adding route" command="ip route add 10.10.0.44/32 dev cilium" netns=cilium-health 2018-06-01T16:38:34.84663536Z level=info msg="Adding route" command="ip route add 0.0.0.0/0 via 10.10.0.44 dev cilium" netns=cilium-health 2018-06-01T16:38:34.846640304Z level=info msg="Adding route" command="ip -6 route add f00d::a0a:0:0:8ad6/128 dev cilium" netns=cilium-health 2018-06-01T16:38:34.846668631Z level=info msg="Adding route" command="ip -6 route add ::/0 via f00d::a0a:0:0:8ad6 dev cilium" netns=cilium-health 2018-06-01T16:38:34.846706547Z level=debug msg="Running \"ip [netns exec cilium-health bash -c ip route add 10.10.0.44/32 dev cilium && ip route add 0.0.0.0/0 via 10.10.0.44 dev cilium && ip -6 route add f00d::a0a:0:0:8ad6/128 dev cilium && ip -6 route add ::/0 via f00d::a0a:0:0:8ad6 dev cilium]\""

This is configuring the routes inside the cilium-health endpoint, not the IP address inside.

From here: https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-K8s/291/testReport/k8s-1/9/K8sValidatedUpdates_Updating_Cilium_stable_to_master/

I see different health EP IPs:

    2018-06-01T16:38:32.105826332Z level=info msg="Addressing information:"
    2018-06-01T16:38:32.1058285Z level=info msg="  Local node-name: k8s1"
    2018-06-01T16:38:32.105830645Z level=info msg="  Node-IPv6: fd04::11"
    2018-06-01T16:38:32.105832676Z level=info msg="  External-Node IPv4: 192.168.36.11"
    2018-06-01T16:38:32.10583474Z level=info msg="  Internal-Node IPv4: 10.10.0.44"
    2018-06-01T16:38:32.105836805Z level=info msg="  Cluster IPv6 prefix: f00d::/64"
    2018-06-01T16:38:32.105838889Z level=info msg="  Cluster IPv4 prefix: 10.0.0.0/8"
    2018-06-01T16:38:32.105849872Z level=info msg="  IPv6 node prefix: f00d::a0a:0:0:0/96"
    2018-06-01T16:38:32.105852251Z level=info msg="  IPv6 allocation prefix: f00d::a0a:0:0:0/112"
    2018-06-01T16:38:32.105854345Z level=info msg="  IPv4 allocation prefix: 10.10.0.0/24"
    2018-06-01T16:38:32.105856358Z level=info msg="  IPv6 router address: f00d::a0a:0:0:8ad6"
    2018-06-01T16:38:32.105858374Z level=info msg="  Loopback IPv4: 10.10.0.55"
...
 2018-06-01T16:38:33.314993027Z level=debug msg="List response from etcd len=36: &{Header:cluster_id:3373127551666285087 member_id:13506963981489885289 revision:1206 raft_term:3  Kvs:[key:\"cilium/state/ip/v1/default/10.10.0.104\" create_revision:175 mod_revision:800 version:5 value:\"{\\\"IP\\\":\\\"10.10.0.104\\\",\\\"Mask\\\":null,\\\"ID\\\":4,\\\"Metadata\\\":\\\"cilium-global:default:k8s1:53091\\\"}\" lease:5794272054943966973  key:\"cilium/state/ip/v1/default/10.10.0.142\" create_revision:1158 mod_revision:1158 version:1 value:\"{\\\"IP\\\":\\\"10.10.0.142\\\",\\\"ID\\\":573,\\\"Metadata\\\":\\\"cilium-global:default:k8s1:48896\\\"}\" lease:5794272054943968302  key:\"cilium/state/ip/v1/default/10.10.0.143\" create_revision:966 mod_revision:966 version:1 value:\"{\\\"IP\\\":\\\"10.10.0.143\\\",\\\"Mask\\\":null,\\\"ID\\\":4,\\\"Metadata\\\":\\\"cilium-global:default:k8s1:56210\\\"}\" lease:5794272054943968127  key:\"cilium/state/ip/v1/default/10.10.0.172\" create_revision:1120 mod_revision:1120 version:1 value:\"{\\\"IP\\\":\\\"10.10.0.172\\\",\\\"ID\\\":59974,\\\"Metadata\\\":\\\"cilium-global:default:k8s1:29898\\\"}\" lease:5794272054943968302  key:\"cilium/state/ip/v1/default/10.10.0.207\" create_revision:838 mod_revision:853 version:2 value:\"{\\\"IP\\\":\\\"10.10.0.207\\\",\\\"Mask\\\":null,\\\"ID\\\":57421,\\\"Metadata\\\":\\\"cilium-global:default:k8s1:61372\\\"}\" lease:5794272054943966973  key:\"cilium/state/ip/v1/default/10.10.0.30\" create_revision:833 mod_revision:845 version:2 value:\"{\\\"IP\\\":\\\"10.10.0.30\\\",\\\"Mask\\\":null,\\\"ID\\\":45373,\\\"Metadata\\\":\\\"cilium-global:default:k8s1:48388\\\"}\" lease:5794272054943966973  key:\"cilium/state/ip/v1/default/10.10.0.32\" create_revision:1146 mod_revision:1146 version:1 value:\"{\\\"IP\\\":\\\"10.10.0.32\\\",\\\"ID\\\":21489,\\\"Metadata\\\":\\\"cilium-global:default:k8s1:13949\\\"}\" lease:5794272054943968302  key:\"cilium/state/ip/v1/default/10.10.0.44\" create_revision:867 mod_revision:867 version:1 value:\"{\\\"IP\\\":\\\"10.10.0.44\\\",\\\"Mask\\\":null,\\\"ID\\\":4,\\\"Metadata\\\":\\\"cilium-global:default:k8s1:60355\\\"}\" lease:5794272054943967976  key:\"cilium/state/ip/v1/default/10.10.0.59\" create_revision:1153 mod_revision:1153 version:1 value:\"{\\\"IP\\\":\\\"10.10.0.59\\\",\\\"ID\\\":2377,\\\"Metadata\\\":\\\"cilium-global:default:k8s1:29381\\\"}\" lease:5794272054943968302  key:\"cilium/state/ip/v1/default/10.10.0.66\" create_revision:1144 mod_revision:1144 version:1 value:\"{\\\"IP\\\":\\\"10.10.0.66\\\",\\\"ID\\\":21489,\\\"Metadata\\\":\\\"cilium-global:default:k8s1:173\\\"}\" lease:5794272054943968302  key:\"cilium/state/ip/v1/default/10.10.1.104\" create_revision:1186 mod_revision:1186 version:1 value:\"{\\\"IP\\\":\\\"10.10.1.104\\\",\\\"Mask\\\":null,\\\"ID\\\":4,\\\"Metadata\\\":\\\"cilium-global:default:k8s2:19262\\\"}\" lease:5794272054943968492  key:\"cilium/state/ip/v1/default/10.10.1.127\" create_revision:1002 mod_revision:1002 version:1 value:\"{\\\"IP\\\":\\\"10.10.1.127\\\",\\\"Mask\\\":null,\\\"ID\\\":4,\\\"Metadata\\\":\\\"cilium-global:default:k8s2:63102\\\"}\" lease:5794272054943968142  key:\"cilium/state/ip/v1/default/10.10.1.130\" create_revision:835 mod_revision:847 version:2 value:\"{\\\"IP\\\":\\\"10.10.1.130\\\",\\\"Mask\\\":null,\\\"ID\\\":45373,\\\"Metadata\\\":\\\"cilium-global:default:k8s2:4057\\\"}\" lease:5794272054943966988  key:\"cilium/state/ip/v1/default/10.10.1.150\" create_revision:915 mod_revision:915 version:1 value:\"{\\\"IP\\\":\\\"10.10.1.150\\\",\\\"Mask\\\":null,\\\"ID\\\":4,\\\"Metadata\\\":\\\"cilium-global:default:k8s2:22375\\\"}\" lease:5794272054943968050  key:\"cilium/state/ip/v1/default/10.10.1.172\" create_revision:1117 mod_revision:1117 version:1 value:\"{\\\"IP\\\":\\\"10.10.1.172\\\",\\\"ID\\\":59974,\\\"Metadata\\\":\\\"cilium-global:default:k8s2:29898\\\"}\" lease:5794272054943968322  key:\"cilium/state/ip/v1/default/10.10.1.221\" create_revision:1072 mod_revision:1072 version:1 value:\"{\\\"IP\\\":\\\"10.10.1.221\\\",\\\"ID\\\":36464,\\\"Metadata\\\":\\\"cilium-global:default:k8s2:22464\\\"}\" lease:5794272054943968322  key:\"cilium/state/ip/v1/default/10.10.1.3\" create_revision:209 mod_revision:802 version:5 value:\"{\\\"IP\\\":\\\"10.10.1.3\\\",\\\"Mask\\\":null,\\\"ID\\\":4,\\\"Metadata\\\":\\\"cilium-global:default:k8s2:54633\\\"}\" lease:5794272054943966988  key:\"cilium/state/ip/v1/default/10.10.1.81\" create_revision:837 mod_revision:857 version:2 value:\"{\\\"IP\\\":\\\"10.10.1.81\\\",\\\"Mask\\\":null,\\\"ID\\\":57421,\\\"Metadata\\\":\\\"cilium-global:default:k8s2:45509\\\"}\" lease:5794272054943966988  key:\"cilium/state/ip/v1/default/f00d::a0a:0:0:367d\" create_revision:1147 mod_revision:1147 version:1 value:\"{\\\"IP\\\":\\\"f00d::a0a:0:0:367d\\\",\\\"ID\\\":21489,\\\"Metadata\\\":\\\"cilium-global:default:k8s1:13949\\\"}\" lease:5794272054943968302  key:\"cilium/state/ip/v1/default/f00d::a0a:0:0:72c5\" create_revision:1152 mod_revision:1152 version:1 value:\"{\\\"IP\\\":\\\"f00d::a0a:0:0:72c5\\\",\\\"ID\\\":2377,\\\"Metadata\\\":\\\"cilium-global:default:k8s1:29381\\\"}\" lease:5794272054943968302  key:\"cilium/state/ip/v1/default/f00d::a0a:0:0:74ca\" create_revision:1121 mod_revision:1121 version:1 value:\"{\\\"IP\\\":\\\"f00d::a0a:0:0:74ca\\\",\\\"ID\\\":59974,\\\"Metadata\\\":\\\"cilium-global:default:k8s1:29898\\\"}\" lease:5794272054943968302  key:\"cilium/state/ip/v1/default/f00d::a0a:0:0:ad\" create_revision:1143 mod_revision:1143 version:1 value:\"{\\\"IP\\\":\\\"f00d::a0a:0:0:ad\\\",\\\"ID\\\":21489,\\\"Metadata\\\":\\\"cilium-global:default:k8s1:173\\\"}\" lease:5794272054943968302  key:\"cilium/state/ip/v1/default/f00d::a0a:0:0:bd04\" create_revision:832 mod_revision:846 version:2 value:\"{\\\"IP\\\":\\\"f00d::a0a:0:0:bd04\\\",\\\"Mask\\\":null,\\\"ID\\\":45373,\\\"Metadata\\\":\\\"cilium-global:default:k8s1:48388\\\"}\" lease:5794272054943966973  key:\"cilium/state/ip/v1/default/f00d::a0a:0:0:bf00\" create_revision:1159 mod_revision:1159 version:1 value:\"{\\\"IP\\\":\\\"f00d::a0a:0:0:bf00\\\",\\\"ID\\\":573,\\\"Metadata\\\":\\\"cilium-global:default:k8s1:48896\\\"}\" lease:5794272054943968302  key:\"cilium/state/ip/v1/default/f00d::a0a:0:0:cf63\" create_revision:179 mod_revision:801 version:5 value:\"{\\\"IP\\\":\\\"f00d::a0a:0:0:cf63\\\",\\\"Mask\\\":null,\\\"ID\\\":4,\\\"Metadata\\\":\\\"cilium-global:default:k8s1:53091\\\"}\" lease:5794272054943966973  key:\"cilium/state/ip/v1/default/f00d::a0a:0:0:db92\" create_revision:967 mod_revision:967 version:1 value:\"{\\\"IP\\\":\\\"f00d::a0a:0:0:db92\\\",\\\"Mask\\\":null,\\\"ID\\\":4,\\\"Metadata\\\":\\\"cilium-global:default:k8s1:56210\\\"}\" lease:5794272054943968127  key:\"cilium/state/ip/v1/default/f00d::a0a:0:0:ebc3\" create_revision:868 mod_revision:868 version:1 value:\"{\\\"IP\\\":\\\"f00d::a0a:0:0:ebc3\\\",\\\"Mask\\\":null,\\\"ID\\\":4,\\\"Metadata\\\":\\\"cilium-global:default:k8s1:60355\\\"}\" lease:5794272054943967976  key:\"cilium/state/ip/v1/default/f00d::a0a:0:0:efbc\" create_revision:839 mod_revision:854 version:2 value:\"{\\\"IP\\\":\\\"f00d::a0a:0:0:efbc\\\",\\\"Mask\\\":null,\\\"ID\\\":57421,\\\"Metadata\\\":\\\"cilium-global:default:k8s1:61372\\\"}\" lease:5794272054943966973  key:\"cilium/state/ip/v1/default/f00d::a0a:100:0:4b3e\" create_revision:1187 mod_revision:1187 version:1 value:\"{\\\"IP\\\":\\\"f00d::a0a:100:0:4b3e\\\",\\\"Mask\\\":null,\\\"ID\\\":4,\\\"Metadata\\\":\\\"cilium-global:default:k8s2:19262\\\"}\" lease:5794272054943968492  key:\"cilium/state/ip/v1/default/f00d::a0a:100:0:5767\" create_revision:916 mod_revision:916 version:1 value:\"{\\\"IP\\\":\\\"f00d::a0a:100:0:5767\\\",\\\"Mask\\\":null,\\\"ID\\\":4,\\\"Metadata\\\":\\\"cilium-global:default:k8s2:22375\\\"}\" lease:5794272054943968050  key:\"cilium/state/ip/v1/default/f00d::a0a:100:0:57c0\" create_revision:1071 mod_revision:1071 version:1 value:\"{\\\"IP\\\":\\\"f00d::a0a:100:0:57c0\\\",\\\"ID\\\":36464,\\\"Metadata\\\":\\\"cilium-global:default:k8s2:22464\\\"}\" lease:5794272054943968322  key:\"cilium/state/ip/v1/default/f00d::a0a:100:0:74ca\" create_revision:1118 mod_revision:1118 version:1 value:\"{\\\"IP\\\":\\\"f00d::a0a:100:0:74ca\\\",\\\"ID\\\":59974,\\\"Metadata\\\":\\\"cilium-global:default:k8s2:29898\\\"}\" lease:5794272054943968322  key:\"cilium/state/ip/v1/default/f00d::a0a:100:0:b1c5\" create_revision:836 mod_revision:856 version:2 value:\"{\\\"IP\\\":\\\"f00d::a0a:100:0:b1c5\\\",\\\"Mask\\\":null,\\\"ID\\\":57421,\\\"Metadata\\\":\\\"cilium-global:default:k8s2:45509\\\"}\" lease:5794272054943966988  key:\"cilium/state/ip/v1/default/f00d::a0a:100:0:d569\" create_revision:208 mod_revision:803 version:5 value:\"{\\\"IP\\\":\\\"f00d::a0a:100:0:d569\\\",\\\"Mask\\\":null,\\\"ID\\\":4,\\\"Metadata\\\":\\\"cilium-global:default:k8s2:54633\\\"}\" lease:5794272054943966988  key:\"cilium/state/ip/v1/default/f00d::a0a:100:0:f67e\" create_revision:1003 mod_revision:1003 version:1 value:\"{\\\"IP\\\":\\\"f00d::a0a:100:0:f67e\\\",\\\"Mask\\\":null,\\\"ID\\\":4,\\\"Metadata\\\":\\\"cilium-global:default:k8s2:63102\\\"}\" lease:5794272054943968142  key:\"cilium/state/ip/v1/default/f00d::a0a:100:0:fd9\" create_revision:834 mod_revision:848 version:2 value:\"{\\\"IP\\\":\\\"f00d::a0a:100:0:fd9\\\",\\\"Mask\\\":null,\\\"ID\\\":45373,\\\"Metadata\\\":\\\"cilium-global:default:k8s2:4057\\\"}\" lease:5794272054943966988 ] More:false Count:36}" prefix=cilium/state/ip/v1 revision=1206 watcher=endpointIPWatcher
...
    2018-06-01T16:38:33.556239149Z level=debug msg="IPv4 health endpoint address: 10.10.0.214"
    2018-06-01T16:38:33.556307165Z level=debug msg="IPv6 health endpoint address: f00d::a0a:0:0:b60e"

Looks to me like Cilium was restarted, and upon restore, we are getting the old health endpoint with an IP that overlaps with the new daemon run's internal IP allocation, Cilium restores that, which conflicts, then subsequently it actually allocates the IP for the new cilium-health endpoint and proceeds to stand it up.

Specifically of interest in the middle part of the above output:

key:\"cilium/state/ip/v1/default/10.10.0.44\" create_revision:867 mod_revision:867 version:1 value:\"{\\\"IP\\\":\\\"10.10.0.44\\\",\\\"Mask\\\":null,\\\"ID\\\":4,\\\"Metadata\\\":\\\"cilium-global:default:k8s1:60355\\\"}\" lease:5794272054943967976  

It doesn't seem like any endpoint is restored with IP 10.10.0.44:

$ grep -r 'Restoring endpoint from' cilium-*-logs.log 
cilium-cq5kn-logs.log:2018-06-01T16:38:32.097187496Z level=debug msg="Restoring endpoint from previous cilium instance" code=OK containerID=893d59bbb5 endpointID=13949 endpointState=restoring ipv4=10.10.0.32 ipv6="f00d::a0a:0:0:367d" k8sPodName=":" policyRevision=0 type=0
cilium-cq5kn-logs.log:2018-06-01T16:38:32.099647488Z level=debug msg="Restoring endpoint from previous cilium instance" code=OK containerID=730a320779 endpointID=173 endpointState=restoring ipv4=10.10.0.66 ipv6="f00d::a0a:0:0:ad" k8sPodName=":" policyRevision=0 type=0
cilium-cq5kn-logs.log:2018-06-01T16:38:32.101237381Z level=debug msg="Restoring endpoint from previous cilium instance" code=OK containerID=ad1542df90 endpointID=29381 endpointState=restoring ipv4=10.10.0.59 ipv6="f00d::a0a:0:0:72c5" k8sPodName=":" policyRevision=0 type=0
cilium-cq5kn-logs.log:2018-06-01T16:38:32.103018867Z level=debug msg="Restoring endpoint from previous cilium instance" code=OK containerID=3c7ac10336 endpointID=48896 endpointState=restoring ipv4=10.10.0.142 ipv6="f00d::a0a:0:0:bf00" k8sPodName=":" policyRevision=0 type=0
cilium-x7gfd-logs.log:2018-06-01T16:38:31.234602654Z level=debug msg="Restoring endpoint from previous cilium instance" code=OK containerID=d6a3684407 endpointID=22464 endpointState=restoring ipv4=10.10.1.221 ipv6="f00d::a0a:100:0:57c0" k8sPodName=":" policyRevision=0 type=0

@rlenglet Perhaps even earlier? I believe that the output I posted above is the output from etcd. Something inserted it there, my guess is that it was a previous iteration of the daemon that allocated that IP to the health EP and pushed it in.

EDIT: Oh, it won't successfully restore that IP because it was already allocated for the internal IP before the restore happens.

Nope, we don't see any conflict about reallocation during endpoint restoration:

$ grep -r 'unable to reallocate IPv4 address' * | wc -l
       0
$ grep -r 'Failed to re-allocate IP of endpoint' * | wc -l
       0

The cilium_health is ignored during restoration because its interface is gone:

$ grep 'could not be found for endpoint being restored' cilium-cq5kn-logs.log  
cilium-cq5kn-logs.log:2018-06-01T16:38:32.10207021Z level=info msg="Interface cilium_health could not be found for endpoint being restored, ignoring" endpointID=29898

When this message is logged, the IP addresses of the endpoint are _not_ re-allocated:

$ grep 'Restoring endpoint' cilium-cq5kn-logs.log | grep 29898 | wc -l
       0

So there is no issue with endpoint restoration, because the health daemon is never restored.
The problem must be somewhere else.

OK, the bug is in syncLXCMap().
In case the mapping for a reserved identity is changed, we call OnIPIdentityCacheChange with oldIPIDPair set to nil, so the old mapping is never deleted, which results in duplicate entries:
https://github.com/cilium/cilium/blob/master/daemon/daemon.go#L1014

listener.OnIPIdentityCacheChange(ipcache.Upsert, nil, pair)

There are only 2 places where we call OnIPIdentityCacheChange, and the other one is OK.

The culprit PR is https://github.com/cilium/cilium/pull/4172 / https://github.com/cilium/cilium/commit/8b64ca827b78a9787015f29cd075d04906068f0c.

I'll send a PR shortly.

Note that my PR will only solve the duplicate IDs for the host IP.
It won't solve the problem with conflicting redirect ports, which was the original issue.

Was this page helpful?
0 / 5 - 0 ratings