Traffic to a pod stops when the pod has the same IP address as another pod with state Completed.
When I take a look at target.addr":"100.96.43.254:8080 from the logs and do kubectl get po -A -owide | grep 100.96.43.254 I get the two pods with the same IP address:
default h-cortex-injixo-6c4f975874-n47fb 3/3 Running 0 40m 100.96.43.254 ip-172-21-2-50.eu-west-1.compute.internal <none> <none>
default h-custom-integrations-interflex-month-balances-1608003600-42242 0/1 Completed 0 31h 100.96.43.254 ip-172-21-2-50.eu-west-1.compute.internal <none> <none>
See: https://github.com/kubernetes/kubernetes/issues/92697
When I delete one of the pods the problem is gone. This problem has occurred at least 3 times in the last 24hrs.
fastly-ingress-7cb9d8f8d4-wtxb4:linkerd-proxy {"timestamp":"[ 384.647220s]","level":"INFO","fields":{"message":"Connection closed","error":"Service in fail-fast"},"target":"linkerd2_app_core::serve","span":{"peer.addr":"100.96.0.146:49428","target.addr":"100.96.43.254:8080","name":"accept"},"spans":[{"name":"outbound"},{"peer.addr":"100.96.0.146:49428","target.addr":"100.96.43.254:8080","name":"accept"}],"threadId":"ThreadId(1)"}
fastly-ingress-7cb9d8f8d4-xp7w2:linkerd-proxy {"timestamp":"[ 219.444015s]","level":"WARN","fields":{"message":"Could not fetch profile","error":"status: Unknown, message: \"http2 error: protocol error: unexpected internal error encountered\", details: [], metadata: MetadataMap { headers: {} }"},"target":"linkerd2_service_profiles::client","span":{"name":"outbound"},"spans":[{"name":"outbound"}],"threadId":"ThreadId(1)"}
fastly-ingress-7cb9d8f8d4-xp7w2:linkerd-proxy {"timestamp":"[ 219.448108s]","level":"WARN","fields":{"message":"Could not fetch profile","error":"status: Unknown, message: \"http2 error: protocol error: unexpected internal error encountered\", details: [], metadata: MetadataMap { headers: {} }"},"target":"linkerd2_service_profiles::client","span":{"name":"outbound"},"spans":[{"name":"outbound"}],"threadId":"ThreadId(1)"}
...
The WARN log line is repeated about 300 times
linkerd check outputkubernetes-api
--------------
โ can initialize the client
โ can query the Kubernetes API
kubernetes-version
------------------
โ is running the minimum Kubernetes API version
โ is running the minimum kubectl version
linkerd-existence
-----------------
โ 'linkerd-config' config map exists
โ heartbeat ServiceAccount exist
โ control plane replica sets are ready
โ no unschedulable pods
โ controller pod is running
โ can initialize the client
โ can query the control plane API
linkerd-config
--------------
โ control plane Namespace exists
โ control plane ClusterRoles exist
โ control plane ClusterRoleBindings exist
โ control plane ServiceAccounts exist
โ control plane CustomResourceDefinitions exist
โ control plane MutatingWebhookConfigurations exist
โ control plane ValidatingWebhookConfigurations exist
โ control plane PodSecurityPolicies exist
linkerd-identity
----------------
โ certificate config is valid
โ trust anchors are using supported crypto algorithm
โ trust anchors are within their validity period
โ trust anchors are valid for at least 60 days
โ issuer cert is using supported crypto algorithm
โ issuer cert is within its validity period
โ issuer cert is valid for at least 60 days
โ issuer cert is issued by the trust anchor
linkerd-webhooks-and-apisvc-tls
-------------------------------
โ tap API server has valid cert
โ tap API server cert is valid for at least 60 days
โ proxy-injector webhook has valid cert
โ proxy-injector cert is valid for at least 60 days
โ sp-validator webhook has valid cert
โ sp-validator cert is valid for at least 60 days
linkerd-api
-----------
โ control plane pods are ready
โ control plane self-check
โ [kubernetes] control plane can talk to Kubernetes
โ [prometheus] control plane can talk to Prometheus
โ tap api service is running
linkerd-version
---------------
โ can determine the latest version
โ cli is up-to-date
control-plane-version
---------------------
โ control plane is up-to-date
โ control plane and cli versions match
linkerd-ha-checks
-----------------
โ pod injection disabled on kube-system
โ multiple replicas of control plane pods
linkerd-prometheus
------------------
โ prometheus add-on service account exists
โ prometheus add-on config map exists
โ prometheus pod is running
linkerd-grafana
---------------
โ grafana add-on service account exists
โ grafana add-on config map exists
โ grafana pod is running
Status check results are โ
Hey @kforsthoevel, when this error happens, is it possible for you to get logs from the Linkerd destination controller? The proxy error that you shared suggests that it is getting an error response from the destination controller so it would be helpful to see those logs.
kubectl -n linkerd logs deploy/linkerd-destination destination
My guess is that we're probably hitting this: https://github.com/linkerd/linkerd2/blob/main/controller/api/destination/watcher/ip_watcher.go#L162
We may need to filter out completed pods from that list.
I ran into a similar issue without completed pods, I think this issue exists when communicating with pods which are using hostPort, which causes an IP conflict with system pods or anything else using hostNetwork: true, even if those pods aren't meshed.
@01100010011001010110010101110000 we already have special logic to exclude pods with hostNetwork: true. See: https://github.com/linkerd/linkerd2/pull/4335. If you're seeing an issue related to hostNetwork pods, please file a separate issue.
@adleong Here are some logs from the linkerd-destination during the incident. Just level=info
...
time="2020-12-16T10:50:52Z" level=info msg="Stopping watch on profile default/h-cortex-injixo" addr=":8086" component=traffic-split-watcher
time="2020-12-16T10:50:50Z" level=info msg="Establishing watch on profile default/h-cortex-injixo.default.svc.cluster.local" addr=":8086" component=profile-watcher
time="2020-12-16T10:50:49Z" level=info msg="Stopping watch on profile default/h-cortex-injixo.default.svc.cluster.local" addr=":8086" component=profile-watcher
time="2020-12-16T10:50:47Z" level=info msg="Establishing watch on endpoint [default/h-cortex-injixo:80]" addr=":8086" component=endpoints-watcher
...
But what I found is this pattern again in the linkerd-proxy of the linkerd prometheus:
{
"timestamp": "[ 70248.367398s]",
"level": "INFO",
"fields": {
"message": "Connection closed",
"error": "Service in fail-fast"
},
"target": "linkerd2_app_core::serve",
"span": {
"peer.addr": "100.96.0.126:55194",
"target.addr": "100.96.43.254:4191",
"name": "accept"
},
"spans": [
{
"name": "outbound"
},
{
"peer.addr": "100.96.0.126:55194",
"target.addr": "100.96.43.254:4191",
"name": "accept"
}
],
"threadId": "ThreadId(1)"
}
Followed by hundreds of these messages:
{
"timestamp": "[ 70257.834108s]",
"level": "WARN",
"fields": {
"message": "Could not fetch profile",
"error": "status: Unknown, message: \"http2 error: protocol error: unexpected internal error encountered\", details: [], metadata: MetadataMap { headers: {} }"
},
"target": "linkerd2_service_profiles::client",
"span": {
"name": "outbound"
},
"spans": [
{
"name": "outbound"
}
],
"threadId": "ThreadId(1)"
}
I hope this help. Let me know if you need further info.
Just hit the bug aswell
Using linkerd 2.9.1
Kubernetes 1.20.1
This is part of the log:
python-sample-6977b96d58-znqpk linkerd-proxy] [ 102.797393s] WARN ThreadId(01) outbound: linkerd2_service_profiles::client: Could not fetch profile error=status: FailedPrecondition, message: "IP address conflict: &Pod{ObjectMeta:{apprepo-apps-sync-cowboysysop-wfctv-vvmnx apprepo-apps-sync-cowboysysop-wfctv- apps e66447da-aa24-4869-881e-6e9611a71070 164617 0 2021-01-11 22:50:13 +0000 UTC <nil> <nil> map[apprepositories.kubeapps.com/repo-name:cowboysysop apprepositories.kubeapps.com/repo-namespace:apps controller-uid:65f1f60d-2283-4487-9566-ce62f31b06a3 job-name:apprepo-apps-sync-cowboysysop-wfctv] map[cni.projectcalico.org/podIP: cni.projectcalico.org/podIPs: container.apparmor.security.beta.kubernetes.io/sync:runtime/default kubernetes.io/limit-ranger:LimitRanger plugin set: cpu, memory request for container sync; cpu, memory limit for container sync kubernetes.io/psp:restricted seccomp.security.alpha.kubernetes.io/pod:runtime/default] [{batch/v1 Job apprepo-apps-sync-cowboysysop-wfctv 65f1f60d-2283-4487-9566-ce62f31b06a3 0xc0014289e3 0xc0
@kfirfer Sorry you ran into this issue! We have #5412 open which should fix this issue. This is a good reminder to get that reviewed and hopefully our next edge release can pick up the fix. I'll make sure to leave a comment here when that get's merged.
FYI, we now delete all completed pods every 30 seconds as a workaround for this issue. It is not ideal.
This should be closed by #5412, and will be included in our edge release later this week.
For anyone that has ran into this issue please reopen if it still occurs.
Just a heads up for those looking to get this fix in, but prefer to remain on stable, stable-2.9.2 was just released today and contains this fix. Thank you for the reports on this!
Most helpful comment
This should be closed by #5412, and will be included in our edge release later this week.
For anyone that has ran into this issue please reopen if it still occurs.