Build link: https://jenkins.cilium.io/job/Ginkgo-CI-Tests-Pipeline/2628/testReport/junit/(root)/runtime/RuntimeValidatedPolicies_L7_Checks/
Stacktrace:
/home/jenkins/workspace/Ginkgo-CI-Tests-Pipeline/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:373
Expected
<*errors.errorString | 0xc4201985a0>: {
s: "Timeout reached: could not import policy",
}
to be nil
/home/jenkins/workspace/Ginkgo-CI-Tests-Pipeline/src/github.com/cilium/cilium/vendor/github.com/onsi/gomega/internal/assertion/assertion.go:27
Stdout:
STEP: Setting PolicyEnforcement=default
锟絒1mSTEP锟絒0m: Setting up policy: /home/vagrant/go/src/github.com/cilium/cilium/test/runtime/manifests/Policies-l7-simple.json
STEP: Simple Ingress
STEP: Client "app1" attempting to "curl public URL on" "httpd1"
STEP: Client "app1" attempting to "curl public URL on" "httpd1"
STEP: Client "app1" attempting to "curl private URL on" "httpd1"
STEP: Client "app1" attempting to "curl private URL on" "httpd1"
STEP: Client "app2" attempting to "curl public URL on" "httpd1"
STEP: Client "app2" attempting to "curl public URL on" "httpd1"
STEP: Simple Egress
STEP: Client "app2" attempting to "curl public URL on" "httpd2"
STEP: Client "app2" attempting to "curl public URL on" "httpd2"
STEP: Client "app2" attempting to "curl private URL on" "httpd2"
STEP: Client "app2" attempting to "curl private URL on" "httpd2"
STEP: Disabling all the policies. All should work
STEP: Client "app1" attempting to "ping" "httpd1"
STEP: Client "app1" attempting to "ping" "httpd1"
STEP: Client "app1" attempting to "curl public URL on" "httpd1"
STEP: Client "app1" attempting to "curl public URL on" "httpd1"
STEP: Client "app1" attempting to "curl private URL on" "httpd1"
STEP: Client "app1" attempting to "curl private URL on" "httpd1"
STEP: Client "app2" attempting to "ping" "httpd1"
STEP: Client "app2" attempting to "ping" "httpd1"
STEP: Client "app2" attempting to "curl public URL on" "httpd1"
STEP: Client "app2" attempting to "curl public URL on" "httpd1"
STEP: Client "app2" attempting to "curl private URL on" "httpd1"
STEP: Client "app2" attempting to "curl private URL on" "httpd1"
STEP: Multiple Ingress
锟絒1mSTEP锟絒0m: Setting up policy: /home/vagrant/go/src/github.com/cilium/cilium/test/runtime/manifests/Policies-l7-multiple.json
===================== TEST FAILED =====================
cmd: sudo cilium endpoint list
ENDPOINT POLICY (ingress) POLICY (egress) IDENTITY LABELS (source:key[=value]) IPv6 IPv4 STATUS
ENFORCEMENT ENFORCEMENT
10572 Disabled Disabled 4 reserved:health f00d::a0f:0:0:294c 10.15.91.32 ready
16275 Enabled Disabled 48404 container:id.httpd2 f00d::a0f:0:0:3f93 10.15.30.223 ready
container:id.service1
31770 Disabled Disabled 7322 container:id.app1 f00d::a0f:0:0:7c1a 10.15.25.236 ready
40645 Disabled Disabled 47704 container:id.app3 f00d::a0f:0:0:9ec5 10.15.169.53 ready
42753 Disabled Enabled 300 container:id.app2 f00d::a0f:0:0:a701 10.15.128.220 not-ready
46043 Enabled Disabled 5899 container:id.httpd1 f00d::a0f:0:0:b3db 10.15.200.94 not-ready
container:id.service1
51876 Disabled Disabled 49623 container:id.httpd3 f00d::a0f:0:0:caa4 10.15.244.249 ready
container:id.service1
===================== EXITING REPORT GENERATION =====================
Logs:
9acb0442_RuntimeValidatedPolicies_L7_Checks.zip
Endpoints in not-ready state due to proxy-redirect configuration failure:
Timestamp Status State Message
2018-05-29T14:47:55Z Failure ready Error regenerating endpoint: Error while configuring proxy redirects: proxy state changes failed: context deadline exceeded
2018-05-29T14:47:55Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-05-29T14:47:45Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
...
cc: @rlenglet @jrajahalme
May 29 14:47:45 runtime cilium-agent[8636]: level=debug msg="level=warning msg=\"external/envoy/source/common/config/grpc_mux_impl.cc:218] gRPC config for type.googleapis.com/envoy.api.v2.Listener update rejected: Error adding/updating listener 42753:egress:TCP:8080: cannot bind '[::]:15714': Address already in use\" subsys=envoy-upstream thread=20977"
May 29 14:47:45 runtime cilium-agent[8636]: level=debug msg="level=warning msg=\"bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_mux_subscription_lib/common/config/grpc_mux_subscription_impl.h:70] gRPC config for type.googleapis.com/envoy.api.v2.Listener rejected: Error adding/updating listener 42753:egress:TCP:8080: cannot bind '[::]:15714': Address already in use\" subsys=envoy-config thread=20977"
May 29 14:47:45 runtime cilium-agent[8636]: level=debug msg="received request from xDS stream" xdsClientNode="id:\"host~127.0.0.1~no-id~localdomain\" cluster:\"cluster1\" build_version:\"cbd93a042e08d55c286949d7282eace2cba2836a/1.7.0-dev/Modified/RELEASE\" " xdsNonce=6 xdsStreamID=1 xdsTypeURL=type.googleapis.com/envoy.api.v2.Listener xdsVersionInfo=6
May 29 14:47:45 runtime cilium-agent[8636]: level=warning msg="NACK received for version 10; waiting for a version update before sending again" xdsClientNode="id:\"host~127.0.0.1~no-id~localdomain\" cluster:\"cluster1\" build_version:\"cbd93a042e08d55c286949d7282eace2cba2836a/1.7.0-dev/Modified/RELEASE\" " xdsNonce=6 xdsStreamID=1 xdsTypeURL=type.googleapis.com/envoy.api.v2.Listener xdsVersionInfo=6
Grepping for the port number:
May 29 14:47:45 runtime cilium-agent[8636]: level=debug msg="Created new http proxy instance" id="42753:egress:TCP:8080" obj="&{ProxyPort:15714 endpointID:42753 id:42753:egress:TCP:8080 ingress:false localEndpoint:0xc420217680 parserType:http created:{wall:13743710510708103467 ext:831416539703 loc:0x3c788a0} implementation:0xc4243563e0 mutex:{internalRWMutex:{RWMutex:{mu:{w:{state:0 sema:0} writerSem:0 readerSem:0 readerCount:0 readerWait:0}}}} lastUpdated:{wall:13743710510708103560 ext:831416539787 loc:0x3c788a0} rules:map[{LabelSelector:0xc420376c00}:{HTTP:[{Path:/public Method:GET Host: Headers:[]}] Kafka:[]}]}"
May 29 14:47:45 runtime cilium-agent[8636]: level=debug msg="level=warning msg=\"external/envoy/source/common/config/grpc_mux_impl.cc:218] gRPC config for type.googleapis.com/envoy.api.v2.Listener update rejected: Error adding/updating listener 42753:egress:TCP:8080: cannot bind '[::]:15714': Address already in use\" subsys=envoy-upstream thread=20977"
May 29 14:47:45 runtime cilium-agent[8636]: level=debug msg="level=warning msg=\"bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_mux_subscription_lib/common/config/grpc_mux_subscription_impl.h:70] gRPC config for type.googleapis.com/envoy.api.v2.Listener rejected: Error adding/updating listener 42753:egress:TCP:8080: cannot bind '[::]:15714': Address already in use\" subsys=envoy-config thread=20977"
May 29 14:47:45 runtime cilium-agent[8636]: level=debug msg="level=warning msg=\"external/envoy/source/common/config/grpc_mux_impl.cc:218] gRPC config for type.googleapis.com/envoy.api.v2.Listener update rejected: Error adding/updating listener 42753:egress:TCP:8080: cannot bind '[::]:15714': Address already in use\" subsys=envoy-upstream thread=20977"
May 29 14:47:45 runtime cilium-agent[8636]: level=debug msg="level=warning msg=\"bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_mux_subscription_lib/common/config/grpc_mux_subscription_impl.h:70] gRPC config for type.googleapis.com/envoy.api.v2.Listener rejected: Error adding/updating listener 42753:egress:TCP:8080: cannot bind '[::]:15714': Address already in use\" subsys=envoy-config thread=20977"
Seems like this was the first allocation of that port, at least in the logs available.
Bugtool newtstat--a.md shows nothing for port 15714.
This is a master build on which the last commit updated the kernel version via a new packer box version (77). Could it be possible that there is a race between the socket.Close() we call from cilium-agent and the socket actually being available for new bind() call from Envoy right after, even when both call sites set the SO_REUSEADDR socket option?
Has the OS been updated with the Vagrant box, besides the kernel? I would check that the range of local ports allocated by Linux hasn't changed.
sysctl net.ipv4.ip_local_port_range
If that overlaps now with the proxy redirect port range configured in Cilium, we need to change that range configuration.
I hit this issue locally and have the environment for it preserved.
@ianvernon I need to restart with the new box version.
In the meantime, could you please run sysctl net.ipv4.ip_local_port_range and post the output?
@rlenglet :
vagrant@runtime:~$ sysctl net.ipv4.ip_local_port_range
net.ipv4.ip_local_port_range = 32768 60999
Ok, that looks fine.
This was hit in a CI test run with debug logs enabled.
Link: https://jenkins.cilium.io/job/Ginkgo-CI-Tests-Pipeline/2653/testReport/junit/(root)/runtime/RuntimeValidatedLB_Services_Policies_tests_with_service_performing_L4_port_mapping/
Stacktrace:
/home/jenkins/workspace/Ginkgo-CI-Tests-Pipeline/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:373
Expected
<*errors.errorString | 0xc42037c3f0>: {
s: "Timeout reached: could not import policy",
}
to be nil
/home/jenkins/workspace/Ginkgo-CI-Tests-Pipeline/src/github.com/cilium/cilium/vendor/github.com/onsi/gomega/internal/assertion/assertion.go:27
Endpoint was in not-ready state:
Setting up policy: /home/vagrant/go/src/github.com/cilium/cilium/test/runtime/manifests/Policies-l7-multiple.json
===================== TEST FAILED =====================
cmd: sudo cilium endpoint list
ENDPOINT POLICY (ingress) POLICY (egress) IDENTITY LABELS (source:key[=value]) IPv6 IPv4 STATUS
ENFORCEMENT ENFORCEMENT
5194 Disabled Disabled 33846 container:id.app1 f00d::a0f:0:0:144a 10.15.236.185 ready
19041 Disabled Enabled 15862 container:id.app2 f00d::a0f:0:0:4a61 10.15.253.22 ready
33652 Disabled Disabled 4 reserved:health f00d::a0f:0:0:8374 10.15.175.138 ready
39028 Enabled Disabled 9971 container:id.httpd1 f00d::a0f:0:0:9874 10.15.11.153 not-ready
container:id.service1
56104 Disabled Disabled 20799 container:id.httpd3 f00d::a0f:0:0:db28 10.15.254.46 ready
container:id.service1
62489 Enabled Disabled 2546 container:id.httpd2 f00d::a0f:0:0:f419 10.15.191.26 ready
container:id.service1
63771 Disabled Disabled 61329 container:id.app3 f00d::a0f:0:0:f91b 10.15.90.89 ready
Logs: 9acb0442_RuntimeValidatedLB_Services_Policies_tests_with_service_performing_L4_port_mapping.zip
From the bugtool output in the attached logs, Debug is enabled for the daemon:
Conntrack Enabled
ConntrackAccounting Enabled
ConntrackLocal Disabled
Debug Enabled
DropNotification Enabled
PolicyTracing Disabled
TraceNotification Enabled
k8s-configuration
k8s-endpoint
PolicyEnforcement default
MonitorNumPages 64
From the endpoint log for the endpoint which was in not-ready state, proxy redirects were unable to be configured:
Timestamp Status State Message
2018-05-31T17:08:42Z Failure ready Error regenerating endpoint: Error while configuring proxy redirects: proxy state changes failed: context deadline exceeded
2018-05-31T17:08:42Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-05-31T17:08:32Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
...
Found another issue:
https://jenkins.cilium.io/job/Ginkgo-CI-Tests-Pipeline/2653/
Timestamp Status State Message
2018-05-31T17:08:42Z Failure ready Error regenerating endpoint: Error while configuring proxy redirects: proxy state changes failed: context deadline exceeded
2018-05-31T17:08:42Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-05-31T17:08:32Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-05-31T17:08:32Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-05-31T17:08:31Z OK ready Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-05-31T17:08:31Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-05-31T17:08:29Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-05-31T17:08:29Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-05-31T17:08:20Z OK ready Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-05-31T17:08:20Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-05-31T17:08:17Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-05-31T17:08:17Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-05-31T17:08:10Z OK ready Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-05-31T17:08:10Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-05-31T17:08:08Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-05-31T17:08:08Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-05-31T17:07:59Z OK ready Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-05-31T17:07:59Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-05-31T17:07:57Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-05-31T17:07:57Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-05-31T17:07:55Z OK ready Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-05-31T17:07:55Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-05-31T17:07:53Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-05-31T17:07:53Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-05-31T17:07:44Z OK ready Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-05-31T17:07:44Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-05-31T17:07:42Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-05-31T17:07:42Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-05-31T17:07:35Z OK ready Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-05-31T17:07:35Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-05-31T17:07:35Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-05-31T17:07:35Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-05-31T17:07:33Z OK ready Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-05-31T17:07:33Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-05-31T17:07:32Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-05-31T17:07:32Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-05-31T17:07:22Z OK ready Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-05-31T17:07:22Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-05-31T17:07:20Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-05-31T17:07:20Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-05-31T17:07:18Z OK ready Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-05-31T17:07:18Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-05-31T17:07:16Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-05-31T17:07:16Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-05-31T17:07:07Z OK ready Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-05-31T17:07:07Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-05-31T17:07:05Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-05-31T17:07:05Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-05-31T17:06:58Z OK ready Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-05-31T17:06:58Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-05-31T17:06:58Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-05-31T17:06:58Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-05-31T17:06:58Z OK ready Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-05-31T17:06:58Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-05-31T17:06:58Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-05-31T17:06:57Z OK waiting-to-regenerate Successfully regenerated endpoint program due to updated security labels
2018-05-31T17:06:57Z Warning waiting-to-regenerate Skipped invalid state transition to ready due to: Completed endpoint regeneration with no pending regeneration requests
2018-05-31T17:06:56Z Warning waiting-to-regenerate Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-05-31T17:06:56Z Warning waiting-to-regenerate Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-05-31T17:06:55Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-05-31T17:06:54Z OK regenerating Regenerating Endpoint BPF: updated security labels
2018-05-31T17:06:54Z Warning waiting-to-regenerate Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-05-31T17:06:54Z OK waiting-to-regenerate Successfully regenerated endpoint program due to updated security labels
2018-05-31T17:06:54Z Warning waiting-to-regenerate Skipped invalid state transition to ready due to: Completed endpoint regeneration with no pending regeneration requests
2018-05-31T17:06:53Z OK waiting-to-regenerate Triggering regeneration due to new identity
2018-05-31T17:06:52Z OK regenerating Regenerating Endpoint BPF: updated security labels
2018-05-31T17:06:52Z OK waiting-to-regenerate Triggering regeneration due to new identity
2018-05-31T17:06:52Z OK ready Set identity for this endpoint
[
{
"id": 39028,
"spec": {
"label-configuration": {
"user": []
},
"options": {
"Conntrack": "Enabled",
"ConntrackAccounting": "Enabled",
"ConntrackLocal": "Disabled",
"Debug": "Enabled",
"DebugLB": "Disabled",
"DropNotification": "Enabled",
"EgressPolicy": "Disabled",
"IngressPolicy": "Enabled",
"NAT46": "Disabled",
"TraceNotification": "Enabled"
}
},
"status": {
"controllers": [
{
"configuration": {
"error-retry": true,
"interval": "5m0s"
},
"name": "resolve-identity-39028",
"status": {
"last-failure-timestamp": "0001-01-01T00:00:00.000Z",
"last-success-timestamp": "2018-05-31T17:11:53.049Z",
"success-count": 2
},
"uuid": "03ed0715-64f5-11e8-b6bd-080027aa233b"
},
{
"configuration": {
"error-retry": true,
"interval": "5m0s"
},
"name": "sync-IPv4-identity-mapping (39028)",
"status": {
"last-failure-timestamp": "0001-01-01T00:00:00.000Z",
"last-success-timestamp": "2018-05-31T17:11:53.061Z",
"success-count": 2
},
"uuid": "03ed9611-64f5-11e8-b6bd-080027aa233b"
},
{
"configuration": {
"error-retry": true,
"interval": "5m0s"
},
"name": "sync-IPv6-identity-mapping (39028)",
"status": {
"last-failure-timestamp": "0001-01-01T00:00:00.000Z",
"last-success-timestamp": "2018-05-31T17:11:53.062Z",
"success-count": 2
},
"uuid": "03ed9724-64f5-11e8-b6bd-080027aa233b"
},
{
"configuration": {
"error-retry": true,
"interval": "1m0s"
},
"name": "sync-identity-to-k8s-pod (39028)",
"status": {
"last-failure-timestamp": "0001-01-01T00:00:00.000Z",
"last-success-timestamp": "2018-05-31T17:13:53.049Z",
"success-count": 8
},
"uuid": "03ed94e5-64f5-11e8-b6bd-080027aa233b"
},
{
"configuration": {
"error-retry": true,
"interval": "1m0s"
},
"name": "sync-policymap-39028",
"status": {
"last-failure-timestamp": "0001-01-01T00:00:00.000Z",
"last-success-timestamp": "2018-05-31T17:14:32.859Z",
"success-count": 7
},
"uuid": "3f660389-64f5-11e8-b6bd-080027aa233b"
}
],
"external-identifiers": {
"container-id": "c4da6d8033619a83448bcc97cc935df4c1b86936008153d502b0ac28c6200148",
"container-name": "httpd1",
"docker-endpoint-id": "32531d626fc09c96cb85a7cf6eb86b6d897d52b4a78a3ddb527a3cc0c3ecc40f",
"docker-network-id": "30712b8c887fef9688f10ba73d00081a32b19ead87d5feaad282559314289c78",
"pod-name": ":"
},
"health": {
"bpf": "Warning",
"connected": true,
"overallHealth": "Warning",
"policy": "Warning"
},
"identity": {
"id": 9971,
"labels": [
"container:id.httpd1",
"container:id.service1"
],
"labelsSHA256": "66a311c75290026d8ed6035cdb9dabddbc9336152aa6aadbdbbbbbf5e0cca9f5"
},
"labels": {
"derived": [],
"disabled": [],
"realized": {
"user": []
},
"security-relevant": [
"container:id.httpd1",
"container:id.service1"
]
},
"log": [
{
"code": "Failure",
"message": "Error regenerating endpoint: Error while configuring proxy redirects: proxy state changes failed: context deadline exceeded",
"state": "ready",
"timestamp": "2018-05-31T17:08:42Z"
}
],
"networking": {
"addressing": [
{
"ipv4": "10.15.11.153",
"ipv6": "f00d::a0f:0:0:9874"
}
],
"host-mac": "02:c5:2c:41:4d:e8",
"interface-index": 65,
"interface-name": "lxc32531",
"mac": "b2:ea:f0:8d:66:03"
},
"policy": {
"proxy-statistics": [
{
"allocated-proxy-port": 15054,
"location": "ingress",
"port": 80,
"protocol": "http",
"statistics": {
"requests": {
"denied": 2,
"forwarded": 5,
"received": 7
},
"responses": {
"forwarded": 5,
"received": 5
}
}
},
{
"allocated-proxy-port": 11766,
"location": "ingress",
"port": 8080,
"protocol": "http",
"statistics": {
"requests": {},
"responses": {}
}
}
],
"realized": {
"allowed-egress-identities": [
2,
2546,
33846,
5,
20799,
3,
15862,
61329,
43088,
1,
4,
9971
],
"allowed-ingress-identities": [
1
],
"build": 24,
"cidr-policy": {
"egress": [],
"ingress": []
},
"id": 9971,
"l4": {
"egress": [],
"ingress": []
},
"policy-enabled": "ingress",
"policy-revision": 24
},
"spec": {
"allowed-egress-identities": [
43088,
9971,
33846,
2546,
15862,
1,
4,
5,
2,
3,
61329,
20799
],
"allowed-ingress-identities": [
1
],
"build": 24,
"cidr-policy": {
"egress": [],
"ingress": []
},
"id": 9971,
"l4": {
"egress": [],
"ingress": [
{
"derived-from-rules": [
[]
],
"rule": {
"l7-rules": [
{
"any.id.app1=": {
"http": [
{
"method": "GET",
"path": "/public"
}
]
}
},
{
"reserved.host=": {
"http": [
{
"method": "GET",
"path": "/public"
}
]
}
}
],
"port": 80,
"protocol": "TCP"
}
},
{
"derived-from-rules": [
[]
],
"rule": {
"l7-rules": [
{
"any.id.app1=": {
"http": [
{
"method": "GET",
"path": "/public"
}
]
}
},
{
"reserved.host=": {
"http": [
{
"method": "GET",
"path": "/public"
}
]
}
}
],
"port": 8080,
"protocol": "TCP"
}
}
]
},
"policy-enabled": "ingress",
"policy-revision": 24
}
},
"realized": {
"label-configuration": {
"user": []
},
"options": {
"Conntrack": "Enabled",
"ConntrackAccounting": "Enabled",
"ConntrackLocal": "Disabled",
"Debug": "Enabled",
"DebugLB": "Disabled",
"DropNotification": "Enabled",
"EgressPolicy": "Disabled",
"IngressPolicy": "Enabled",
"NAT46": "Disabled",
"TraceNotification": "Enabled"
}
},
"state": "not-ready"
}
}
]
DIRECTION LABELS (source:key[=value]) PORT/PROTO BYTES PACKETS
Ingress container:id.app1 80/TCP 0 0
Egress reserved:init ANY 0 0
Ingress container:id.app1 8080/TCP 0 0
Egress container:id.httpd3 ANY 0 0
container:id.service1
Egress container:id.httpd2 ANY 0 0
container:id.service1
Egress container:id.client ANY 0 0
Egress reserved:world ANY 0 0
Egress container:id.app2 ANY 0 0
Egress container:id.httpd1 ANY 0 0
container:id.service1
Egress reserved:host ANY 0 0
Ingress reserved:host ANY 0 0
Egress container:id.app3 ANY 0 0
Egress reserved:cluster ANY 0 0
Egress container:id.app1 ANY 0 0
Egress reserved:health ANY 0 0
All logs attached.
9acb0442_RuntimeValidatedLB_Services_Policies_tests_with_service_performing_L4_port_mapping.zip
Another issue:
Overall Health: Warning
BPF Health: Warning
Policy Health: Warning
Connected: yes
Timestamp Status State Message
2018-06-01T13:24:17Z Failure ready Error regenerating endpoint: Error while configuring proxy redirects: proxy state changes failed: context deadline exceeded
2018-06-01T13:24:17Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-06-01T13:24:07Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-01T13:24:07Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-06-01T13:24:01Z OK ready Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-01T13:24:01Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-06-01T13:24:01Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-01T13:24:01Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-06-01T13:23:57Z OK ready Endpoint policy update skipped because no changes were needed
2018-06-01T13:23:46Z OK ready Endpoint policy update skipped because no changes were needed
2018-06-01T13:23:44Z OK ready Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-01T13:23:44Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-06-01T13:23:42Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-01T13:23:42Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-06-01T13:23:34Z OK ready Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-01T13:23:34Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-06-01T13:23:31Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-01T13:23:31Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-06-01T13:23:25Z OK ready Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-01T13:23:25Z OK ready Completed endpoint regeneration with no pending regeneration requests
2018-06-01T13:23:25Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-01T13:23:25Z Warning waiting-to-regenerate Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-01T13:23:24Z OK waiting-to-regenerate Successfully regenerated endpoint program due to endpoint policy updated & changes were needed
2018-06-01T13:23:24Z Warning waiting-to-regenerate Skipped invalid state transition to ready due to: Completed endpoint regeneration with no pending regeneration requests
2018-06-01T13:23:22Z Warning waiting-to-regenerate Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-01T13:23:22Z Warning waiting-to-regenerate Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates
2018-06-01T13:23:21Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-06-01T13:23:21Z OK regenerating Regenerating Endpoint BPF: endpoint policy updated & changes were needed
2018-06-01T13:23:21Z OK waiting-to-regenerate Successfully regenerated endpoint program due to updated security labels
2018-06-01T13:23:21Z Warning waiting-to-regenerate Skipped invalid state transition to ready due to: Completed endpoint regeneration with no pending regeneration requests
2018-06-01T13:23:19Z Warning waiting-to-regenerate Skipped invalid state transition to waiting-to-regenerate due to: Triggering regeneration due to new identity
2018-06-01T13:23:19Z OK waiting-to-regenerate Triggering endpoint regeneration due to policy updates
2018-06-01T13:23:19Z OK regenerating Regenerating Endpoint BPF: updated security labels
2018-06-01T13:23:19Z OK waiting-to-regenerate Triggering regeneration due to new identity
2018-06-01T13:23:19Z OK ready Set identity for this endpoint
https://jenkins.cilium.io/job/Ginkgo-CI-Tests-Pipeline/2663/
5c6211b5_RuntimeValidatedLB_Services_Policies_tests_with_conntrack_disabled.zip
2018-06-01T16:38:38.319642482Z level=warning msg="NACK received for version 48; waiting for a version update before sending again" xdsClientNode="id:\"host~127.0.0.1~no-id~localdomain\" cluster:\"cluster1\" build_version:\"d9c3421c784e0874297301d77308a2bd784e6e7b/1.7.0-dev/Distribution/RELEASE\" " xdsNonce=2 xdsStreamID=2 xdsTypeURL=type.googleapis.com/cilium.NetworkPolicyHosts xdsVersionInfo=47
One significant piece of info here is that the error happens with NetworkPolicyHosts i.e., CIDR/policyID map update:
2018-06-01T16:38:38.319546387Z level=debug msg="level=warning msg=\"bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_mux_subscription_lib/common/config/grpc_mux_subscription_impl.h:70] gRPC config for type.googleapis.com/cilium.NetworkPolicyHosts rejected: NetworkPolicyHosts: Duplicate host entry '10.10.0.44' for policy 4, already mapped to 1\" subsys=envoy-config thread=374"
NetworkPolicyHosts: Duplicate host entry '10.10.0.44' for policy 4, already mapped to 1
cilium bpf ipcache list shows this:
10.10.0.44/32 1
So the mapping to 4 is an inconsistency in how we update xDS cache, apparently the address was previously mapped to 4 and then to 1, but that update was not properly done on the xDS cache.
This IP is the node IP:
2018-06-01T16:38:32.10583474Z level=info msg=" Internal-Node IPv4: 10.10.0.44"
Which first has the ID 1 (HOST):
2018-06-01T16:38:33.305888014Z level=debug msg="daemon notified of IP-Identity cache state change" identity=host ipAddr=10.10.0.44 ipMask="<nil>" modification=Upsert
Then etcd tells us the ID is 4:
2018-06-01T16:38:33.314993027Z level=debug msg="List response from etcd len=36: &{Header:cluster_id:3373127551666285087 member_id:13506963981489885289 revision:1206 raft_term:3 Kvs:[
...
{\\\"IP\\\":\\\"10.10.0.44\\\",\\\"Mask\\\":null,\\\"ID\\\":4,
2018-06-01T16:38:33.322006363Z level=debug msg="received event" key=cilium/state/ip/v1/default/10.10.0.44 kvstore-event=create
2018-06-01T16:38:33.322050482Z level=debug msg="Removing from ipcache layer" ipAddr=10.10.0.44
2018-06-01T16:38:33.322069008Z level=debug msg="Upserting into ipcache layer" identity=health ipAddr=10.10.0.44
2018-06-01T16:38:33.322511441Z level=debug msg="endpoint IP cache state change" cached-identity=host identity=health ipAddr=10.10.0.44 ipMask="<nil>" modification=Upsert
2018-06-01T16:38:33.322525425Z level=debug msg="1 resource names requested, filtering resources" xdsClientNode="<nil>" xdsTypeURL=type.googleapis.com/cilium.NetworkPolicyHosts xdsVersionInfo="<nil>"
2018-06-01T16:38:33.322529765Z level=debug msg="resource found, last modified in version 4" xdsClientNode="<nil>" xdsResourceName=1 xdsTypeURL=type.googleapis.com/cilium.NetworkPolicyHosts xdsVersionInfo="<nil>"
2018-06-01T16:38:33.322533424Z level=debug msg="returning 1 resources out of 1 requested" xdsClientNode="<nil>" xdsTypeURL=type.googleapis.com/cilium.NetworkPolicyHosts xdsVersionInfo="<nil>"
2018-06-01T16:38:33.322547617Z level=debug msg="preparing new cache transaction: upserting 1 entries, deleting 0 entries" xdsTypeURL=type.googleapis.com/cilium.NetworkPolicyHosts xdsVersionInfo=16
2018-06-01T16:38:33.322577082Z level=debug msg="updating resource in cache" xdsResourceName=1 xdsTypeURL=type.googleapis.com/cilium.NetworkPolicyHosts xdsVersionInfo=16
2018-06-01T16:38:33.322583364Z level=debug msg="committing cache transaction and notifying of new version" xdsTypeURL=type.googleapis.com/cilium.NetworkPolicyHosts xdsVersionInfo=16
Next we use this address for cilium-health:
2018-06-01T16:38:34.84656439Z level=info msg="Adding route" command="ip route add 10.10.0.44/32 dev cilium" netns=cilium-health
2018-06-01T16:38:34.84663536Z level=info msg="Adding route" command="ip route add 0.0.0.0/0 via 10.10.0.44 dev cilium" netns=cilium-health
2018-06-01T16:38:34.846640304Z level=info msg="Adding route" command="ip -6 route add f00d::a0a:0:0:8ad6/128 dev cilium" netns=cilium-health
2018-06-01T16:38:34.846668631Z level=info msg="Adding route" command="ip -6 route add ::/0 via f00d::a0a:0:0:8ad6 dev cilium" netns=cilium-health
2018-06-01T16:38:34.846706547Z level=debug msg="Running \"ip [netns exec cilium-health bash -c ip route add 10.10.0.44/32 dev cilium && ip route add 0.0.0.0/0 via 10.10.0.44 dev cilium && ip -6 route add f00d::a0a:0:0:8ad6/128 dev cilium && ip -6 route add ::/0 via f00d::a0a:0:0:8ad6 dev cilium]\""
And then this happens:
2018-06-01T16:38:38.316456561Z level=debug msg="Adding special identity to ipcache" ipAddr=10.10.0.44
2018-06-01T16:38:38.316834134Z level=debug msg="Removing from ipcache layer" ipAddr=10.10.0.44
2018-06-01T16:38:38.316931843Z level=debug msg="Upserting into ipcache layer" identity=host ipAddr=10.10.0.44
2018-06-01T16:38:38.316984338Z level=debug msg="1 resource names requested, filtering resources" xdsClientNode="<nil>" xdsTypeURL=type.googleapis.com/cilium.NetworkPolicyHosts xdsVersionInfo="<nil>"
2018-06-01T16:38:38.31703281Z level=debug msg="resource found, last modified in version 16" xdsClientNode="<nil>" xdsResourceName=1 xdsTypeURL=type.googleapis.com/cilium.NetworkPolicyHosts xdsVersionInfo="<nil>"
2018-06-01T16:38:38.317053351Z level=debug msg="returning 1 resources out of 1 requested" xdsClientNode="<nil>" xdsTypeURL=type.googleapis.com/cilium.NetworkPolicyHosts xdsVersionInfo="<nil>"
2018-06-01T16:38:38.317105577Z level=debug msg="preparing new cache transaction: upserting 1 entries, deleting 0 entries" xdsTypeURL=type.googleapis.com/cilium.NetworkPolicyHosts xdsVersionInfo=48
2018-06-01T16:38:38.317145543Z level=debug msg="updating resource in cache" xdsResourceName=1 xdsTypeURL=type.googleapis.com/cilium.NetworkPolicyHosts xdsVersionInfo=48
2018-06-01T16:38:38.317181418Z level=debug msg="committing cache transaction and notifying of new version" xdsTypeURL=type.googleapis.com/cilium.NetworkPolicyHosts xdsVersionInfo=48
2018-06-01T16:38:38.317224729Z level=debug msg="daemon notified of IP-Identity cache state change" identity=host ipAddr=10.10.0.44 ipMask="<nil>" modification=Upsert
2018-06-01T16:38:38.317368896Z level=debug msg="getting 0 resources from set" xdsClientNode="id:\"host~127.0.0.1~no-id~localdomain\" cluster:\"cluster1\" build_version:\"d9c3421c784e0874297301d77308a2bd784e6e7b/1.7.0-dev/Distribution/RELEASE\" " xdsTypeURL=type.googleapis.com/cilium.NetworkPolicyHosts xdsVersionInfo=0xc421dfd158
2018-06-01T16:38:38.317429352Z level=debug msg="no resource names requested, returning all 11 resources" xdsClientNode="id:\"host~127.0.0.1~no-id~localdomain\" cluster:\"cluster1\" build_version:\"d9c3421c784e0874297301d77308a2bd784e6e7b/1.7.0-dev/Distribution/RELEASE\" " xdsTypeURL=type.googleapis.com/cilium.NetworkPolicyHosts xdsVersionInfo=0xc421dfd158
2018-06-01T16:38:38.317608525Z level=debug msg="sending xDS response with 11 resources" xdsCanary=false xdsNonce=2 xdsStreamID=2 xdsTypeURL=type.googleapis.com/cilium.NetworkPolicyHosts xdsVersionInfo=48
2018-06-01T16:38:38.319478263Z level=debug msg="level=warning msg=\"external/envoy/source/common/config/grpc_mux_impl.cc:218] gRPC config for type.googleapis.com/cilium.NetworkPolicyHosts update rejected: NetworkPolicyHosts: Duplicate host entry '10.10.0.44' for policy 4, already mapped to 1\" subsys=envoy-upstream thread=374"
2018-06-01T16:38:38.319546387Z level=debug msg="level=warning msg=\"bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_mux_subscription_lib/common/config/grpc_mux_subscription_impl.h:70] gRPC config for type.googleapis.com/cilium.NetworkPolicyHosts rejected: NetworkPolicyHosts: Duplicate host entry '10.10.0.44' for policy 4, already mapped to 1\" subsys=envoy-config thread=374"
2018-06-01T16:38:38.31955499Z level=debug msg="level=warning msg=\"cilium_host_map.cc:186] Bad NetworkPolicyHosts Configuration\" subsys=envoy-config thread=374"
2018-06-01T16:38:38.319567388Z level=debug msg="received request from xDS stream" xdsClientNode="id:\"host~127.0.0.1~no-id~localdomain\" cluster:\"cluster1\" build_version:\"d9c3421c784e0874297301d77308a2bd784e6e7b/1.7.0-dev/Distribution/RELEASE\" " xdsNonce=2 xdsStreamID=2 xdsTypeURL=type.googleapis.com/cilium.NetworkPolicyHosts xdsVersionInfo=47
2018-06-01T16:38:38.319642482Z level=warning msg="NACK received for version 48; waiting for a version update before sending again" xdsClientNode="id:\"host~127.0.0.1~no-id~localdomain\" cluster:\"cluster1\" build_version:\"d9c3421c784e0874297301d77308a2bd784e6e7b/1.7.0-dev/Distribution/RELEASE\" " xdsNonce=2 xdsStreamID=2 xdsTypeURL=type.googleapis.com/cilium.NetworkPolicyHosts xdsVersionInfo=47
@joestringer Need your view on this, may relate to how "health" is reached and/or to other recent host/health reserved ID changes (what does '4' stand for?).
@jrajahalme 4 is the reserved identity for reserved:health.
So we try to associate the host IP to both 1 (reserved:host) and 4 (reserved:health). Which kind of makes sense, since they both use the same IP.
There's no easy solution here. :/
One question - do we need to send the health IP to XDS? When will a user configure policy for health vs. the host?
It鈥檒l be needed to enforce policies to/from health daemon.
Or we can decide that it should get identity 1 / host instead of a custom id.
Note that bpf ipcache also can only have one mapping for an IP address. In this case it is 1 (see comment above).
It鈥檒l be needed to enforce policies to/from health daemon.
Is this a valid use case? What I'm saying is that since health is a 'special' endpoint, maybe we can have special checks for it by not allowing policy to apply to it, and not to send it to the XDS cache either.
Yes it鈥檚 a valid use case. We鈥檒l want to exercise policy enforcement to check the health of the Cilium agent.
But my point is that we don鈥檛 necessarily need to give the health daemon ID 4.
We could give it ID 1.
The health endpoint has its own IP, why do we associate it with the host IP?
Note that in the section above, I quote:
Next we use this address for cilium-health:
2018-06-01T16:38:34.84656439Z level=info msg="Adding route" command="ip route add 10.10.0.44/32 dev cilium" netns=cilium-health 2018-06-01T16:38:34.84663536Z level=info msg="Adding route" command="ip route add 0.0.0.0/0 via 10.10.0.44 dev cilium" netns=cilium-health 2018-06-01T16:38:34.846640304Z level=info msg="Adding route" command="ip -6 route add f00d::a0a:0:0:8ad6/128 dev cilium" netns=cilium-health 2018-06-01T16:38:34.846668631Z level=info msg="Adding route" command="ip -6 route add ::/0 via f00d::a0a:0:0:8ad6 dev cilium" netns=cilium-health 2018-06-01T16:38:34.846706547Z level=debug msg="Running \"ip [netns exec cilium-health bash -c ip route add 10.10.0.44/32 dev cilium && ip route add 0.0.0.0/0 via 10.10.0.44 dev cilium && ip -6 route add f00d::a0a:0:0:8ad6/128 dev cilium && ip -6 route add ::/0 via f00d::a0a:0:0:8ad6 dev cilium]\""
This is configuring the routes inside the cilium-health endpoint, not the IP address inside.
I see different health EP IPs:
2018-06-01T16:38:32.105826332Z level=info msg="Addressing information:"
2018-06-01T16:38:32.1058285Z level=info msg=" Local node-name: k8s1"
2018-06-01T16:38:32.105830645Z level=info msg=" Node-IPv6: fd04::11"
2018-06-01T16:38:32.105832676Z level=info msg=" External-Node IPv4: 192.168.36.11"
2018-06-01T16:38:32.10583474Z level=info msg=" Internal-Node IPv4: 10.10.0.44"
2018-06-01T16:38:32.105836805Z level=info msg=" Cluster IPv6 prefix: f00d::/64"
2018-06-01T16:38:32.105838889Z level=info msg=" Cluster IPv4 prefix: 10.0.0.0/8"
2018-06-01T16:38:32.105849872Z level=info msg=" IPv6 node prefix: f00d::a0a:0:0:0/96"
2018-06-01T16:38:32.105852251Z level=info msg=" IPv6 allocation prefix: f00d::a0a:0:0:0/112"
2018-06-01T16:38:32.105854345Z level=info msg=" IPv4 allocation prefix: 10.10.0.0/24"
2018-06-01T16:38:32.105856358Z level=info msg=" IPv6 router address: f00d::a0a:0:0:8ad6"
2018-06-01T16:38:32.105858374Z level=info msg=" Loopback IPv4: 10.10.0.55"
...
2018-06-01T16:38:33.314993027Z level=debug msg="List response from etcd len=36: &{Header:cluster_id:3373127551666285087 member_id:13506963981489885289 revision:1206 raft_term:3 Kvs:[key:\"cilium/state/ip/v1/default/10.10.0.104\" create_revision:175 mod_revision:800 version:5 value:\"{\\\"IP\\\":\\\"10.10.0.104\\\",\\\"Mask\\\":null,\\\"ID\\\":4,\\\"Metadata\\\":\\\"cilium-global:default:k8s1:53091\\\"}\" lease:5794272054943966973 key:\"cilium/state/ip/v1/default/10.10.0.142\" create_revision:1158 mod_revision:1158 version:1 value:\"{\\\"IP\\\":\\\"10.10.0.142\\\",\\\"ID\\\":573,\\\"Metadata\\\":\\\"cilium-global:default:k8s1:48896\\\"}\" lease:5794272054943968302 key:\"cilium/state/ip/v1/default/10.10.0.143\" create_revision:966 mod_revision:966 version:1 value:\"{\\\"IP\\\":\\\"10.10.0.143\\\",\\\"Mask\\\":null,\\\"ID\\\":4,\\\"Metadata\\\":\\\"cilium-global:default:k8s1:56210\\\"}\" lease:5794272054943968127 key:\"cilium/state/ip/v1/default/10.10.0.172\" create_revision:1120 mod_revision:1120 version:1 value:\"{\\\"IP\\\":\\\"10.10.0.172\\\",\\\"ID\\\":59974,\\\"Metadata\\\":\\\"cilium-global:default:k8s1:29898\\\"}\" lease:5794272054943968302 key:\"cilium/state/ip/v1/default/10.10.0.207\" create_revision:838 mod_revision:853 version:2 value:\"{\\\"IP\\\":\\\"10.10.0.207\\\",\\\"Mask\\\":null,\\\"ID\\\":57421,\\\"Metadata\\\":\\\"cilium-global:default:k8s1:61372\\\"}\" lease:5794272054943966973 key:\"cilium/state/ip/v1/default/10.10.0.30\" create_revision:833 mod_revision:845 version:2 value:\"{\\\"IP\\\":\\\"10.10.0.30\\\",\\\"Mask\\\":null,\\\"ID\\\":45373,\\\"Metadata\\\":\\\"cilium-global:default:k8s1:48388\\\"}\" lease:5794272054943966973 key:\"cilium/state/ip/v1/default/10.10.0.32\" create_revision:1146 mod_revision:1146 version:1 value:\"{\\\"IP\\\":\\\"10.10.0.32\\\",\\\"ID\\\":21489,\\\"Metadata\\\":\\\"cilium-global:default:k8s1:13949\\\"}\" lease:5794272054943968302 key:\"cilium/state/ip/v1/default/10.10.0.44\" create_revision:867 mod_revision:867 version:1 value:\"{\\\"IP\\\":\\\"10.10.0.44\\\",\\\"Mask\\\":null,\\\"ID\\\":4,\\\"Metadata\\\":\\\"cilium-global:default:k8s1:60355\\\"}\" lease:5794272054943967976 key:\"cilium/state/ip/v1/default/10.10.0.59\" create_revision:1153 mod_revision:1153 version:1 value:\"{\\\"IP\\\":\\\"10.10.0.59\\\",\\\"ID\\\":2377,\\\"Metadata\\\":\\\"cilium-global:default:k8s1:29381\\\"}\" lease:5794272054943968302 key:\"cilium/state/ip/v1/default/10.10.0.66\" create_revision:1144 mod_revision:1144 version:1 value:\"{\\\"IP\\\":\\\"10.10.0.66\\\",\\\"ID\\\":21489,\\\"Metadata\\\":\\\"cilium-global:default:k8s1:173\\\"}\" lease:5794272054943968302 key:\"cilium/state/ip/v1/default/10.10.1.104\" create_revision:1186 mod_revision:1186 version:1 value:\"{\\\"IP\\\":\\\"10.10.1.104\\\",\\\"Mask\\\":null,\\\"ID\\\":4,\\\"Metadata\\\":\\\"cilium-global:default:k8s2:19262\\\"}\" lease:5794272054943968492 key:\"cilium/state/ip/v1/default/10.10.1.127\" create_revision:1002 mod_revision:1002 version:1 value:\"{\\\"IP\\\":\\\"10.10.1.127\\\",\\\"Mask\\\":null,\\\"ID\\\":4,\\\"Metadata\\\":\\\"cilium-global:default:k8s2:63102\\\"}\" lease:5794272054943968142 key:\"cilium/state/ip/v1/default/10.10.1.130\" create_revision:835 mod_revision:847 version:2 value:\"{\\\"IP\\\":\\\"10.10.1.130\\\",\\\"Mask\\\":null,\\\"ID\\\":45373,\\\"Metadata\\\":\\\"cilium-global:default:k8s2:4057\\\"}\" lease:5794272054943966988 key:\"cilium/state/ip/v1/default/10.10.1.150\" create_revision:915 mod_revision:915 version:1 value:\"{\\\"IP\\\":\\\"10.10.1.150\\\",\\\"Mask\\\":null,\\\"ID\\\":4,\\\"Metadata\\\":\\\"cilium-global:default:k8s2:22375\\\"}\" lease:5794272054943968050 key:\"cilium/state/ip/v1/default/10.10.1.172\" create_revision:1117 mod_revision:1117 version:1 value:\"{\\\"IP\\\":\\\"10.10.1.172\\\",\\\"ID\\\":59974,\\\"Metadata\\\":\\\"cilium-global:default:k8s2:29898\\\"}\" lease:5794272054943968322 key:\"cilium/state/ip/v1/default/10.10.1.221\" create_revision:1072 mod_revision:1072 version:1 value:\"{\\\"IP\\\":\\\"10.10.1.221\\\",\\\"ID\\\":36464,\\\"Metadata\\\":\\\"cilium-global:default:k8s2:22464\\\"}\" lease:5794272054943968322 key:\"cilium/state/ip/v1/default/10.10.1.3\" create_revision:209 mod_revision:802 version:5 value:\"{\\\"IP\\\":\\\"10.10.1.3\\\",\\\"Mask\\\":null,\\\"ID\\\":4,\\\"Metadata\\\":\\\"cilium-global:default:k8s2:54633\\\"}\" lease:5794272054943966988 key:\"cilium/state/ip/v1/default/10.10.1.81\" create_revision:837 mod_revision:857 version:2 value:\"{\\\"IP\\\":\\\"10.10.1.81\\\",\\\"Mask\\\":null,\\\"ID\\\":57421,\\\"Metadata\\\":\\\"cilium-global:default:k8s2:45509\\\"}\" lease:5794272054943966988 key:\"cilium/state/ip/v1/default/f00d::a0a:0:0:367d\" create_revision:1147 mod_revision:1147 version:1 value:\"{\\\"IP\\\":\\\"f00d::a0a:0:0:367d\\\",\\\"ID\\\":21489,\\\"Metadata\\\":\\\"cilium-global:default:k8s1:13949\\\"}\" lease:5794272054943968302 key:\"cilium/state/ip/v1/default/f00d::a0a:0:0:72c5\" create_revision:1152 mod_revision:1152 version:1 value:\"{\\\"IP\\\":\\\"f00d::a0a:0:0:72c5\\\",\\\"ID\\\":2377,\\\"Metadata\\\":\\\"cilium-global:default:k8s1:29381\\\"}\" lease:5794272054943968302 key:\"cilium/state/ip/v1/default/f00d::a0a:0:0:74ca\" create_revision:1121 mod_revision:1121 version:1 value:\"{\\\"IP\\\":\\\"f00d::a0a:0:0:74ca\\\",\\\"ID\\\":59974,\\\"Metadata\\\":\\\"cilium-global:default:k8s1:29898\\\"}\" lease:5794272054943968302 key:\"cilium/state/ip/v1/default/f00d::a0a:0:0:ad\" create_revision:1143 mod_revision:1143 version:1 value:\"{\\\"IP\\\":\\\"f00d::a0a:0:0:ad\\\",\\\"ID\\\":21489,\\\"Metadata\\\":\\\"cilium-global:default:k8s1:173\\\"}\" lease:5794272054943968302 key:\"cilium/state/ip/v1/default/f00d::a0a:0:0:bd04\" create_revision:832 mod_revision:846 version:2 value:\"{\\\"IP\\\":\\\"f00d::a0a:0:0:bd04\\\",\\\"Mask\\\":null,\\\"ID\\\":45373,\\\"Metadata\\\":\\\"cilium-global:default:k8s1:48388\\\"}\" lease:5794272054943966973 key:\"cilium/state/ip/v1/default/f00d::a0a:0:0:bf00\" create_revision:1159 mod_revision:1159 version:1 value:\"{\\\"IP\\\":\\\"f00d::a0a:0:0:bf00\\\",\\\"ID\\\":573,\\\"Metadata\\\":\\\"cilium-global:default:k8s1:48896\\\"}\" lease:5794272054943968302 key:\"cilium/state/ip/v1/default/f00d::a0a:0:0:cf63\" create_revision:179 mod_revision:801 version:5 value:\"{\\\"IP\\\":\\\"f00d::a0a:0:0:cf63\\\",\\\"Mask\\\":null,\\\"ID\\\":4,\\\"Metadata\\\":\\\"cilium-global:default:k8s1:53091\\\"}\" lease:5794272054943966973 key:\"cilium/state/ip/v1/default/f00d::a0a:0:0:db92\" create_revision:967 mod_revision:967 version:1 value:\"{\\\"IP\\\":\\\"f00d::a0a:0:0:db92\\\",\\\"Mask\\\":null,\\\"ID\\\":4,\\\"Metadata\\\":\\\"cilium-global:default:k8s1:56210\\\"}\" lease:5794272054943968127 key:\"cilium/state/ip/v1/default/f00d::a0a:0:0:ebc3\" create_revision:868 mod_revision:868 version:1 value:\"{\\\"IP\\\":\\\"f00d::a0a:0:0:ebc3\\\",\\\"Mask\\\":null,\\\"ID\\\":4,\\\"Metadata\\\":\\\"cilium-global:default:k8s1:60355\\\"}\" lease:5794272054943967976 key:\"cilium/state/ip/v1/default/f00d::a0a:0:0:efbc\" create_revision:839 mod_revision:854 version:2 value:\"{\\\"IP\\\":\\\"f00d::a0a:0:0:efbc\\\",\\\"Mask\\\":null,\\\"ID\\\":57421,\\\"Metadata\\\":\\\"cilium-global:default:k8s1:61372\\\"}\" lease:5794272054943966973 key:\"cilium/state/ip/v1/default/f00d::a0a:100:0:4b3e\" create_revision:1187 mod_revision:1187 version:1 value:\"{\\\"IP\\\":\\\"f00d::a0a:100:0:4b3e\\\",\\\"Mask\\\":null,\\\"ID\\\":4,\\\"Metadata\\\":\\\"cilium-global:default:k8s2:19262\\\"}\" lease:5794272054943968492 key:\"cilium/state/ip/v1/default/f00d::a0a:100:0:5767\" create_revision:916 mod_revision:916 version:1 value:\"{\\\"IP\\\":\\\"f00d::a0a:100:0:5767\\\",\\\"Mask\\\":null,\\\"ID\\\":4,\\\"Metadata\\\":\\\"cilium-global:default:k8s2:22375\\\"}\" lease:5794272054943968050 key:\"cilium/state/ip/v1/default/f00d::a0a:100:0:57c0\" create_revision:1071 mod_revision:1071 version:1 value:\"{\\\"IP\\\":\\\"f00d::a0a:100:0:57c0\\\",\\\"ID\\\":36464,\\\"Metadata\\\":\\\"cilium-global:default:k8s2:22464\\\"}\" lease:5794272054943968322 key:\"cilium/state/ip/v1/default/f00d::a0a:100:0:74ca\" create_revision:1118 mod_revision:1118 version:1 value:\"{\\\"IP\\\":\\\"f00d::a0a:100:0:74ca\\\",\\\"ID\\\":59974,\\\"Metadata\\\":\\\"cilium-global:default:k8s2:29898\\\"}\" lease:5794272054943968322 key:\"cilium/state/ip/v1/default/f00d::a0a:100:0:b1c5\" create_revision:836 mod_revision:856 version:2 value:\"{\\\"IP\\\":\\\"f00d::a0a:100:0:b1c5\\\",\\\"Mask\\\":null,\\\"ID\\\":57421,\\\"Metadata\\\":\\\"cilium-global:default:k8s2:45509\\\"}\" lease:5794272054943966988 key:\"cilium/state/ip/v1/default/f00d::a0a:100:0:d569\" create_revision:208 mod_revision:803 version:5 value:\"{\\\"IP\\\":\\\"f00d::a0a:100:0:d569\\\",\\\"Mask\\\":null,\\\"ID\\\":4,\\\"Metadata\\\":\\\"cilium-global:default:k8s2:54633\\\"}\" lease:5794272054943966988 key:\"cilium/state/ip/v1/default/f00d::a0a:100:0:f67e\" create_revision:1003 mod_revision:1003 version:1 value:\"{\\\"IP\\\":\\\"f00d::a0a:100:0:f67e\\\",\\\"Mask\\\":null,\\\"ID\\\":4,\\\"Metadata\\\":\\\"cilium-global:default:k8s2:63102\\\"}\" lease:5794272054943968142 key:\"cilium/state/ip/v1/default/f00d::a0a:100:0:fd9\" create_revision:834 mod_revision:848 version:2 value:\"{\\\"IP\\\":\\\"f00d::a0a:100:0:fd9\\\",\\\"Mask\\\":null,\\\"ID\\\":45373,\\\"Metadata\\\":\\\"cilium-global:default:k8s2:4057\\\"}\" lease:5794272054943966988 ] More:false Count:36}" prefix=cilium/state/ip/v1 revision=1206 watcher=endpointIPWatcher
...
2018-06-01T16:38:33.556239149Z level=debug msg="IPv4 health endpoint address: 10.10.0.214"
2018-06-01T16:38:33.556307165Z level=debug msg="IPv6 health endpoint address: f00d::a0a:0:0:b60e"
Looks to me like Cilium was restarted, and upon restore, we are getting the old health endpoint with an IP that overlaps with the new daemon run's internal IP allocation, Cilium restores that, which conflicts, then subsequently it actually allocates the IP for the new cilium-health endpoint and proceeds to stand it up.
Specifically of interest in the middle part of the above output:
key:\"cilium/state/ip/v1/default/10.10.0.44\" create_revision:867 mod_revision:867 version:1 value:\"{\\\"IP\\\":\\\"10.10.0.44\\\",\\\"Mask\\\":null,\\\"ID\\\":4,\\\"Metadata\\\":\\\"cilium-global:default:k8s1:60355\\\"}\" lease:5794272054943967976
It doesn't seem like any endpoint is restored with IP 10.10.0.44:
$ grep -r 'Restoring endpoint from' cilium-*-logs.log
cilium-cq5kn-logs.log:2018-06-01T16:38:32.097187496Z level=debug msg="Restoring endpoint from previous cilium instance" code=OK containerID=893d59bbb5 endpointID=13949 endpointState=restoring ipv4=10.10.0.32 ipv6="f00d::a0a:0:0:367d" k8sPodName=":" policyRevision=0 type=0
cilium-cq5kn-logs.log:2018-06-01T16:38:32.099647488Z level=debug msg="Restoring endpoint from previous cilium instance" code=OK containerID=730a320779 endpointID=173 endpointState=restoring ipv4=10.10.0.66 ipv6="f00d::a0a:0:0:ad" k8sPodName=":" policyRevision=0 type=0
cilium-cq5kn-logs.log:2018-06-01T16:38:32.101237381Z level=debug msg="Restoring endpoint from previous cilium instance" code=OK containerID=ad1542df90 endpointID=29381 endpointState=restoring ipv4=10.10.0.59 ipv6="f00d::a0a:0:0:72c5" k8sPodName=":" policyRevision=0 type=0
cilium-cq5kn-logs.log:2018-06-01T16:38:32.103018867Z level=debug msg="Restoring endpoint from previous cilium instance" code=OK containerID=3c7ac10336 endpointID=48896 endpointState=restoring ipv4=10.10.0.142 ipv6="f00d::a0a:0:0:bf00" k8sPodName=":" policyRevision=0 type=0
cilium-x7gfd-logs.log:2018-06-01T16:38:31.234602654Z level=debug msg="Restoring endpoint from previous cilium instance" code=OK containerID=d6a3684407 endpointID=22464 endpointState=restoring ipv4=10.10.1.221 ipv6="f00d::a0a:100:0:57c0" k8sPodName=":" policyRevision=0 type=0
@rlenglet Perhaps even earlier? I believe that the output I posted above is the output from etcd. Something inserted it there, my guess is that it was a previous iteration of the daemon that allocated that IP to the health EP and pushed it in.
EDIT: Oh, it won't successfully restore that IP because it was already allocated for the internal IP before the restore happens.
Nope, we don't see any conflict about reallocation during endpoint restoration:
$ grep -r 'unable to reallocate IPv4 address' * | wc -l
0
$ grep -r 'Failed to re-allocate IP of endpoint' * | wc -l
0
The cilium_health is ignored during restoration because its interface is gone:
$ grep 'could not be found for endpoint being restored' cilium-cq5kn-logs.log
cilium-cq5kn-logs.log:2018-06-01T16:38:32.10207021Z level=info msg="Interface cilium_health could not be found for endpoint being restored, ignoring" endpointID=29898
When this message is logged, the IP addresses of the endpoint are _not_ re-allocated:
$ grep 'Restoring endpoint' cilium-cq5kn-logs.log | grep 29898 | wc -l
0
So there is no issue with endpoint restoration, because the health daemon is never restored.
The problem must be somewhere else.
OK, the bug is in syncLXCMap().
In case the mapping for a reserved identity is changed, we call OnIPIdentityCacheChange with oldIPIDPair set to nil, so the old mapping is never deleted, which results in duplicate entries:
https://github.com/cilium/cilium/blob/master/daemon/daemon.go#L1014
listener.OnIPIdentityCacheChange(ipcache.Upsert, nil, pair)
There are only 2 places where we call OnIPIdentityCacheChange, and the other one is OK.
The culprit PR is https://github.com/cilium/cilium/pull/4172 / https://github.com/cilium/cilium/commit/8b64ca827b78a9787015f29cd075d04906068f0c.
I'll send a PR shortly.
Note that my PR will only solve the duplicate IDs for the host IP.
It won't solve the problem with conflicting redirect ports, which was the original issue.