Linkerd2: Traffic stops when two pods have the same IP address

Created on 16 Dec 2020  ยท  10Comments  ยท  Source: linkerd/linkerd2

Bug Report

What is the issue?

Traffic to a pod stops when the pod has the same IP address as another pod with state Completed.

When I take a look at target.addr":"100.96.43.254:8080 from the logs and do kubectl get po -A -owide | grep 100.96.43.254 I get the two pods with the same IP address:

default         h-cortex-injixo-6c4f975874-n47fb                                      3/3     Running     0          40m     100.96.43.254    ip-172-21-2-50.eu-west-1.compute.internal    <none>           <none>
default         h-custom-integrations-interflex-month-balances-1608003600-42242       0/1     Completed   0          31h     100.96.43.254    ip-172-21-2-50.eu-west-1.compute.internal    <none>           <none>

See: https://github.com/kubernetes/kubernetes/issues/92697

When I delete one of the pods the problem is gone. This problem has occurred at least 3 times in the last 24hrs.

How can it be reproduced?

Logs, error output, etc

fastly-ingress-7cb9d8f8d4-wtxb4:linkerd-proxy {"timestamp":"[   384.647220s]","level":"INFO","fields":{"message":"Connection closed","error":"Service in fail-fast"},"target":"linkerd2_app_core::serve","span":{"peer.addr":"100.96.0.146:49428","target.addr":"100.96.43.254:8080","name":"accept"},"spans":[{"name":"outbound"},{"peer.addr":"100.96.0.146:49428","target.addr":"100.96.43.254:8080","name":"accept"}],"threadId":"ThreadId(1)"}
fastly-ingress-7cb9d8f8d4-xp7w2:linkerd-proxy {"timestamp":"[   219.444015s]","level":"WARN","fields":{"message":"Could not fetch profile","error":"status: Unknown, message: \"http2 error: protocol error: unexpected internal error encountered\", details: [], metadata: MetadataMap { headers: {} }"},"target":"linkerd2_service_profiles::client","span":{"name":"outbound"},"spans":[{"name":"outbound"}],"threadId":"ThreadId(1)"}
fastly-ingress-7cb9d8f8d4-xp7w2:linkerd-proxy {"timestamp":"[   219.448108s]","level":"WARN","fields":{"message":"Could not fetch profile","error":"status: Unknown, message: \"http2 error: protocol error: unexpected internal error encountered\", details: [], metadata: MetadataMap { headers: {} }"},"target":"linkerd2_service_profiles::client","span":{"name":"outbound"},"spans":[{"name":"outbound"}],"threadId":"ThreadId(1)"}
... 
The WARN log line is repeated about 300 times

linkerd check output

kubernetes-api
--------------
โˆš can initialize the client
โˆš can query the Kubernetes API

kubernetes-version
------------------
โˆš is running the minimum Kubernetes API version
โˆš is running the minimum kubectl version

linkerd-existence
-----------------
โˆš 'linkerd-config' config map exists
โˆš heartbeat ServiceAccount exist
โˆš control plane replica sets are ready
โˆš no unschedulable pods
โˆš controller pod is running
โˆš can initialize the client
โˆš can query the control plane API

linkerd-config
--------------
โˆš control plane Namespace exists
โˆš control plane ClusterRoles exist
โˆš control plane ClusterRoleBindings exist
โˆš control plane ServiceAccounts exist
โˆš control plane CustomResourceDefinitions exist
โˆš control plane MutatingWebhookConfigurations exist
โˆš control plane ValidatingWebhookConfigurations exist
โˆš control plane PodSecurityPolicies exist

linkerd-identity
----------------
โˆš certificate config is valid
โˆš trust anchors are using supported crypto algorithm
โˆš trust anchors are within their validity period
โˆš trust anchors are valid for at least 60 days
โˆš issuer cert is using supported crypto algorithm
โˆš issuer cert is within its validity period
โˆš issuer cert is valid for at least 60 days
โˆš issuer cert is issued by the trust anchor

linkerd-webhooks-and-apisvc-tls
-------------------------------
โˆš tap API server has valid cert
โˆš tap API server cert is valid for at least 60 days
โˆš proxy-injector webhook has valid cert
โˆš proxy-injector cert is valid for at least 60 days
โˆš sp-validator webhook has valid cert
โˆš sp-validator cert is valid for at least 60 days

linkerd-api
-----------
โˆš control plane pods are ready
โˆš control plane self-check
โˆš [kubernetes] control plane can talk to Kubernetes
โˆš [prometheus] control plane can talk to Prometheus
โˆš tap api service is running

linkerd-version
---------------
โˆš can determine the latest version
โˆš cli is up-to-date

control-plane-version
---------------------
โˆš control plane is up-to-date
โˆš control plane and cli versions match

linkerd-ha-checks
-----------------
โˆš pod injection disabled on kube-system
โˆš multiple replicas of control plane pods

linkerd-prometheus
------------------
โˆš prometheus add-on service account exists
โˆš prometheus add-on config map exists
โˆš prometheus pod is running

linkerd-grafana
---------------
โˆš grafana add-on service account exists
โˆš grafana add-on config map exists
โˆš grafana pod is running

Status check results are โˆš

Environment

  • Kubernetes Version: v1.18.13
  • Cluster Environment: kops on AWS
  • Host OS: Ubuntu 20.04 LTS
  • Linkerd version: Client version: stable-2.9.1, Server version: stable-2.9.1

Possible solution

Additional context

bug

Most helpful comment

This should be closed by #5412, and will be included in our edge release later this week.

For anyone that has ran into this issue please reopen if it still occurs.

All 10 comments

Hey @kforsthoevel, when this error happens, is it possible for you to get logs from the Linkerd destination controller? The proxy error that you shared suggests that it is getting an error response from the destination controller so it would be helpful to see those logs.

kubectl -n linkerd logs deploy/linkerd-destination destination

My guess is that we're probably hitting this: https://github.com/linkerd/linkerd2/blob/main/controller/api/destination/watcher/ip_watcher.go#L162

We may need to filter out completed pods from that list.

I ran into a similar issue without completed pods, I think this issue exists when communicating with pods which are using hostPort, which causes an IP conflict with system pods or anything else using hostNetwork: true, even if those pods aren't meshed.

@01100010011001010110010101110000 we already have special logic to exclude pods with hostNetwork: true. See: https://github.com/linkerd/linkerd2/pull/4335. If you're seeing an issue related to hostNetwork pods, please file a separate issue.

@adleong Here are some logs from the linkerd-destination during the incident. Just level=info

...
time="2020-12-16T10:50:52Z" level=info msg="Stopping watch on profile default/h-cortex-injixo" addr=":8086" component=traffic-split-watcher
time="2020-12-16T10:50:50Z" level=info msg="Establishing watch on profile default/h-cortex-injixo.default.svc.cluster.local" addr=":8086" component=profile-watcher
time="2020-12-16T10:50:49Z" level=info msg="Stopping watch on profile default/h-cortex-injixo.default.svc.cluster.local" addr=":8086" component=profile-watcher
time="2020-12-16T10:50:47Z" level=info msg="Establishing watch on endpoint [default/h-cortex-injixo:80]" addr=":8086" component=endpoints-watcher
...

But what I found is this pattern again in the linkerd-proxy of the linkerd prometheus:

{
  "timestamp": "[ 70248.367398s]",
  "level": "INFO",
  "fields": {
    "message": "Connection closed",
    "error": "Service in fail-fast"
  },
  "target": "linkerd2_app_core::serve",
  "span": {
    "peer.addr": "100.96.0.126:55194",
    "target.addr": "100.96.43.254:4191",
    "name": "accept"
  },
  "spans": [
    {
      "name": "outbound"
    },
    {
      "peer.addr": "100.96.0.126:55194",
      "target.addr": "100.96.43.254:4191",
      "name": "accept"
    }
  ],
  "threadId": "ThreadId(1)"
}

Followed by hundreds of these messages:

{
  "timestamp": "[ 70257.834108s]",
  "level": "WARN",
  "fields": {
    "message": "Could not fetch profile",
    "error": "status: Unknown, message: \"http2 error: protocol error: unexpected internal error encountered\", details: [], metadata: MetadataMap { headers: {} }"
  },
  "target": "linkerd2_service_profiles::client",
  "span": {
    "name": "outbound"
  },
  "spans": [
    {
      "name": "outbound"
    }
  ],
  "threadId": "ThreadId(1)"
}

I hope this help. Let me know if you need further info.

Just hit the bug aswell

Using linkerd 2.9.1
Kubernetes 1.20.1

This is part of the log:

python-sample-6977b96d58-znqpk linkerd-proxy] [   102.797393s]  WARN ThreadId(01) outbound: linkerd2_service_profiles::client: Could not fetch profile error=status: FailedPrecondition, message: "IP address conflict: &Pod{ObjectMeta:{apprepo-apps-sync-cowboysysop-wfctv-vvmnx apprepo-apps-sync-cowboysysop-wfctv- apps  e66447da-aa24-4869-881e-6e9611a71070 164617 0 2021-01-11 22:50:13 +0000 UTC <nil> <nil> map[apprepositories.kubeapps.com/repo-name:cowboysysop apprepositories.kubeapps.com/repo-namespace:apps controller-uid:65f1f60d-2283-4487-9566-ce62f31b06a3 job-name:apprepo-apps-sync-cowboysysop-wfctv] map[cni.projectcalico.org/podIP: cni.projectcalico.org/podIPs: container.apparmor.security.beta.kubernetes.io/sync:runtime/default kubernetes.io/limit-ranger:LimitRanger plugin set: cpu, memory request for container sync; cpu, memory limit for container sync kubernetes.io/psp:restricted seccomp.security.alpha.kubernetes.io/pod:runtime/default] [{batch/v1 Job apprepo-apps-sync-cowboysysop-wfctv 65f1f60d-2283-4487-9566-ce62f31b06a3 0xc0014289e3 0xc0

@kfirfer Sorry you ran into this issue! We have #5412 open which should fix this issue. This is a good reminder to get that reviewed and hopefully our next edge release can pick up the fix. I'll make sure to leave a comment here when that get's merged.

FYI, we now delete all completed pods every 30 seconds as a workaround for this issue. It is not ideal.

This should be closed by #5412, and will be included in our edge release later this week.

For anyone that has ran into this issue please reopen if it still occurs.

Just a heads up for those looking to get this fix in, but prefer to remain on stable, stable-2.9.2 was just released today and contains this fix. Thank you for the reports on this!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

coleca picture coleca  ยท  4Comments

briansmith picture briansmith  ยท  4Comments

manimaul picture manimaul  ยท  3Comments

steve-fraser picture steve-fraser  ยท  4Comments

zaharidichev picture zaharidichev  ยท  4Comments