Linkerd2: Traffic stops when two pods have the same IP address

Created on 16 Dec 2020 · 10Comments · Source: linkerd/linkerd2

Bug Report

What is the issue?

Traffic to a pod stops when the pod has the same IP address as another pod with state Completed.

When I take a look at target.addr":"100.96.43.254:8080 from the logs and do kubectl get po -A -owide | grep 100.96.43.254 I get the two pods with the same IP address:

default         h-cortex-injixo-6c4f975874-n47fb                                      3/3     Running     0          40m     100.96.43.254    ip-172-21-2-50.eu-west-1.compute.internal    <none>           <none>
default         h-custom-integrations-interflex-month-balances-1608003600-42242       0/1     Completed   0          31h     100.96.43.254    ip-172-21-2-50.eu-west-1.compute.internal    <none>           <none>

See: https://github.com/kubernetes/kubernetes/issues/92697

When I delete one of the pods the problem is gone. This problem has occurred at least 3 times in the last 24hrs.

How can it be reproduced?

Logs, error output, etc

fastly-ingress-7cb9d8f8d4-wtxb4:linkerd-proxy {"timestamp":"[   384.647220s]","level":"INFO","fields":{"message":"Connection closed","error":"Service in fail-fast"},"target":"linkerd2_app_core::serve","span":{"peer.addr":"100.96.0.146:49428","target.addr":"100.96.43.254:8080","name":"accept"},"spans":[{"name":"outbound"},{"peer.addr":"100.96.0.146:49428","target.addr":"100.96.43.254:8080","name":"accept"}],"threadId":"ThreadId(1)"}
fastly-ingress-7cb9d8f8d4-xp7w2:linkerd-proxy {"timestamp":"[   219.444015s]","level":"WARN","fields":{"message":"Could not fetch profile","error":"status: Unknown, message: \"http2 error: protocol error: unexpected internal error encountered\", details: [], metadata: MetadataMap { headers: {} }"},"target":"linkerd2_service_profiles::client","span":{"name":"outbound"},"spans":[{"name":"outbound"}],"threadId":"ThreadId(1)"}
fastly-ingress-7cb9d8f8d4-xp7w2:linkerd-proxy {"timestamp":"[   219.448108s]","level":"WARN","fields":{"message":"Could not fetch profile","error":"status: Unknown, message: \"http2 error: protocol error: unexpected internal error encountered\", details: [], metadata: MetadataMap { headers: {} }"},"target":"linkerd2_service_profiles::client","span":{"name":"outbound"},"spans":[{"name":"outbound"}],"threadId":"ThreadId(1)"}
... 
The WARN log line is repeated about 300 times

`linkerd check` output

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor

linkerd-webhooks-and-apisvc-tls
-------------------------------
√ tap API server has valid cert
√ tap API server cert is valid for at least 60 days
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days

linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ tap api service is running

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

control-plane-version
---------------------
√ control plane is up-to-date
√ control plane and cli versions match

linkerd-ha-checks
-----------------
√ pod injection disabled on kube-system
√ multiple replicas of control plane pods

linkerd-prometheus
------------------
√ prometheus add-on service account exists
√ prometheus add-on config map exists
√ prometheus pod is running

linkerd-grafana
---------------
√ grafana add-on service account exists
√ grafana add-on config map exists
√ grafana pod is running

Status check results are √

Environment

Kubernetes Version: v1.18.13
Cluster Environment: kops on AWS
Host OS: Ubuntu 20.04 LTS
Linkerd version: Client version: stable-2.9.1, Server version: stable-2.9.1

Possible solution

Additional context

bug

Source

kforsthoevel

👍9

Most helpful comment

This should be closed by #5412, and will be included in our edge release later this week.

For anyone that has ran into this issue please reopen if it still occurs.

kleimkuhler on 12 Jan 2021

❤3

All 10 comments

Hey @kforsthoevel, when this error happens, is it possible for you to get logs from the Linkerd destination controller? The proxy error that you shared suggests that it is getting an error response from the destination controller so it would be helpful to see those logs.

kubectl -n linkerd logs deploy/linkerd-destination destination

adleong on 16 Dec 2020

My guess is that we're probably hitting this: https://github.com/linkerd/linkerd2/blob/main/controller/api/destination/watcher/ip_watcher.go#L162

We may need to filter out completed pods from that list.

adleong on 16 Dec 2020

I ran into a similar issue without completed pods, I think this issue exists when communicating with pods which are using hostPort, which causes an IP conflict with system pods or anything else using hostNetwork: true, even if those pods aren't meshed.

01100010011001010110010101110000 on 16 Dec 2020

@01100010011001010110010101110000 we already have special logic to exclude pods with hostNetwork: true. See: https://github.com/linkerd/linkerd2/pull/4335. If you're seeing an issue related to hostNetwork pods, please file a separate issue.

adleong on 16 Dec 2020

👍1

@adleong Here are some logs from the linkerd-destination during the incident. Just level=info

...
time="2020-12-16T10:50:52Z" level=info msg="Stopping watch on profile default/h-cortex-injixo" addr=":8086" component=traffic-split-watcher
time="2020-12-16T10:50:50Z" level=info msg="Establishing watch on profile default/h-cortex-injixo.default.svc.cluster.local" addr=":8086" component=profile-watcher
time="2020-12-16T10:50:49Z" level=info msg="Stopping watch on profile default/h-cortex-injixo.default.svc.cluster.local" addr=":8086" component=profile-watcher
time="2020-12-16T10:50:47Z" level=info msg="Establishing watch on endpoint [default/h-cortex-injixo:80]" addr=":8086" component=endpoints-watcher
...

But what I found is this pattern again in the linkerd-proxy of the linkerd prometheus:

{
  "timestamp": "[ 70248.367398s]",
  "level": "INFO",
  "fields": {
    "message": "Connection closed",
    "error": "Service in fail-fast"
  },
  "target": "linkerd2_app_core::serve",
  "span": {
    "peer.addr": "100.96.0.126:55194",
    "target.addr": "100.96.43.254:4191",
    "name": "accept"
  },
  "spans": [
    {
      "name": "outbound"
    },
    {
      "peer.addr": "100.96.0.126:55194",
      "target.addr": "100.96.43.254:4191",
      "name": "accept"
    }
  ],
  "threadId": "ThreadId(1)"
}

Followed by hundreds of these messages:

{
  "timestamp": "[ 70257.834108s]",
  "level": "WARN",
  "fields": {
    "message": "Could not fetch profile",
    "error": "status: Unknown, message: \"http2 error: protocol error: unexpected internal error encountered\", details: [], metadata: MetadataMap { headers: {} }"
  },
  "target": "linkerd2_service_profiles::client",
  "span": {
    "name": "outbound"
  },
  "spans": [
    {
      "name": "outbound"
    }
  ],
  "threadId": "ThreadId(1)"
}

I hope this help. Let me know if you need further info.

kforsthoevel on 17 Dec 2020

Just hit the bug aswell

Using linkerd 2.9.1
Kubernetes 1.20.1

This is part of the log:

python-sample-6977b96d58-znqpk linkerd-proxy] [   102.797393s]  WARN ThreadId(01) outbound: linkerd2_service_profiles::client: Could not fetch profile error=status: FailedPrecondition, message: "IP address conflict: &Pod{ObjectMeta:{apprepo-apps-sync-cowboysysop-wfctv-vvmnx apprepo-apps-sync-cowboysysop-wfctv- apps  e66447da-aa24-4869-881e-6e9611a71070 164617 0 2021-01-11 22:50:13 +0000 UTC <nil> <nil> map[apprepositories.kubeapps.com/repo-name:cowboysysop apprepositories.kubeapps.com/repo-namespace:apps controller-uid:65f1f60d-2283-4487-9566-ce62f31b06a3 job-name:apprepo-apps-sync-cowboysysop-wfctv] map[cni.projectcalico.org/podIP: cni.projectcalico.org/podIPs: container.apparmor.security.beta.kubernetes.io/sync:runtime/default kubernetes.io/limit-ranger:LimitRanger plugin set: cpu, memory request for container sync; cpu, memory limit for container sync kubernetes.io/psp:restricted seccomp.security.alpha.kubernetes.io/pod:runtime/default] [{batch/v1 Job apprepo-apps-sync-cowboysysop-wfctv 65f1f60d-2283-4487-9566-ce62f31b06a3 0xc0014289e3 0xc0

kfirfer on 12 Jan 2021

@kfirfer Sorry you ran into this issue! We have #5412 open which should fix this issue. This is a good reminder to get that reviewed and hopefully our next edge release can pick up the fix. I'll make sure to leave a comment here when that get's merged.

kleimkuhler on 12 Jan 2021

👍1

FYI, we now delete all completed pods every 30 seconds as a workaround for this issue. It is not ideal.

akloss-cibo on 12 Jan 2021

This should be closed by #5412, and will be included in our edge release later this week.

For anyone that has ran into this issue please reopen if it still occurs.

kleimkuhler on 12 Jan 2021

❤3

Just a heads up for those looking to get this fix in, but prefer to remain on stable, stable-2.9.2 was just released today and contains this fix. Thank you for the reports on this!

kleimkuhler on 14 Jan 2021

🎉1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Add option to inject to set resource limits

coleca · 4Comments

Proxy's logs have timestamps in an unreadable format different than what's typically used by Kubernetes

briansmith · 4Comments

Developer Documentation

manimaul · 3Comments

Add support for rate limiting

steve-fraser · 4Comments

Test cleanup does not work on MacOS

zaharidichev · 4Comments

Linkerd2: Traffic stops when two pods have the same IP address

Bug Report

What is the issue?

How can it be reproduced?

Logs, error output, etc

linkerd check output

Environment

Possible solution

Additional context

Most helpful comment

All 10 comments

Related issues

`linkerd check` output