Linkerd2: Linkerd 2.5.0: linkerd2_proxy::app::errors unexpected error: error trying to connect: No route to host (os error 113) (address: 10.10.3.181:8080)

Created on 18 Sep 2019  ยท  57Comments  ยท  Source: linkerd/linkerd2

Bug Report

What is the issue?

We have injected pod which is running for days connecting to a partially (1/3) injected deployment, which is eventually throwing the mentioned error.

How can it be reproduced?

Run a pod for days and let it talk to a deployment which is regularly restarted, which is causing new pods and new IP addresses etc.

Logs, error output, etc

linkerd-proxy ERR! [589538.085822s] linkerd2_proxy::app::errors unexpected error: error trying to connect: No route to host (os error 113) (address: 10.10.3.181:8080)

linkerd check output

kubernetes-api
--------------
โˆš can initialize the client
โˆš can query the Kubernetes API

kubernetes-version
------------------
โˆš is running the minimum Kubernetes API version
โˆš is running the minimum kubectl version

linkerd-config
--------------
โˆš control plane Namespace exists
โˆš control plane ClusterRoles exist
โˆš control plane ClusterRoleBindings exist
โˆš control plane ServiceAccounts exist
โˆš control plane CustomResourceDefinitions exist
โˆš control plane MutatingWebhookConfigurations exist
โˆš control plane ValidatingWebhookConfigurations exist
โˆš control plane PodSecurityPolicies exist

linkerd-existence
-----------------
โˆš 'linkerd-config' config map exists
โˆš control plane replica sets are ready
โˆš no unschedulable pods
โˆš controller pod is running
โˆš can initialize the client
โˆš can query the control plane API

linkerd-api
-----------
โˆš control plane pods are ready
โˆš control plane self-check
โˆš [kubernetes] control plane can talk to Kubernetes
โˆš [prometheus] control plane can talk to Prometheus
โˆš no invalid service profiles

linkerd-version
---------------
โˆš can determine the latest version
โˆš cli is up-to-date

control-plane-version
---------------------
โˆš control plane is up-to-date
โˆš control plane and cli versions match

Status check results are โˆš

Environment

  • Kubernetes Version: 1.15.2
  • Cluster Environment: custom
  • Host OS: CoreOS 2191.5.0
  • Linkerd version: 2.5.0

Possible solution

Additional context

To me, not knowing all details, it looks like that the proxy is not "refreshing" the endpoints for the service and eventually just runs out of ip addresses. For us it would be fine if the proxy would exit and let the pod get restarted.

Also: linkerd is pretty awesome, thanks for all your effort you put into it!

bug bustaleness needrepro

All 57 comments

At a super high level, the proxy discovers new endpoints via:

proxy (watch) -> destination (watch) -> api-server

In the past when this has been an issue, it is because there's a poor connection between destination and the api-server. Next time this happens, try restarting the linkerd-controller pod and see if that fixes it. If that doesn't fix the problem, check out the endpoints for the service and make sure those are being updated correctly.

@grampelberg thanks for the swift response.

If that doesn't fix the problem, check out the endpoints for the service and make sure those are being updated correctly.

I assume you mean the kubernetes service. The endpoints there are correct, 100%, the rest of the system is using that service successfully.

In the past when this has been an issue, it is because there's a poor connection between destination and the api-server.

We tried to see anything in the logs of these components, without finding anything that looks like an error or "poor" connection.

Do you have any hints for which kind of log messages we would look out for?

Thanks!

There wouldn't be anything in the logs for the cases that I'm thinkin of. Basically, the connection dies silently and no updates happen.

kubernetes service

api-server specifically. It sounds like we're on the same page =)

Hello,

I am in the same team as @bjoernhaeuser.

The connection issue just appeared again. As suggested, restarting all Linkerd controllers fixed the problem.

Nevertheless, letting a connection silently die and not logging anything doesn't sound quite resilient to me. Is this somehow an expected behavior, or is this just hard to fix currently?

Also, we do not know how to properly debug the apiserver connection issues. Do you have any hints for us, how to do this? Unfortunately I forgot to take a look at the logs, before restarting the controllers.

Thank you!

@svenwltr @bjoernhaeuser I'm working to reproduce this and will let you know my findings. I hope to be able to reproduce it in a time period shorter than days.

Is there any chance you have an estimate of the number of times that the deployments were restarted? Also, how many replicas of the restarted deployment are you running?

I'm running in GKE with a cluster running 382 pods on version 1.12.8-gke.10. LinkerD is running in HA and I'm seeing that at random intervals of a few hours to 10 days I will start getting 503s originating from one of my backend services. There are 3 services that communicate via http calls (Services A, B, C). Service A is a public facing API and makes calls back to services B and C. When the events happen we will begin seeing the 503s from service B. Additionally, using prometheus I can see a step increase in memory usage in the linkerd-controller pods leading up to the 503 responses.

These 503 responses bubble up to our public API (Service A) after 3 seconds. This situation quickly degrades into most calls to service A returning 503s also. The 503 events usually occur with a deployment of Service A but will also occur after about 3 days max of continuous operation with no deployments.

This has happened 4 times now and the only thing that seems to correct it is restarting the linkerd-controller deployment. I've tried restarting each affected service separately but I still received 503s until restarting the LinkerD Controller.

I also tried uninstalling linkerd from the cluster completely and reinstalling it but the problem remains. Currently I have uninjected LinkerD from the services that have been affected and so far I haven't seen any of the 503s.

Service A listed above has 36 pods each with 2 containers 1 being the linkerd-proxy.

Service B listed above has 18 pods each with 2 containers 1 being the linkerd-proxy.

There are 10 total namespaces with linkerd injected. When I uninject the proxies from Services A, B, and C linkerd is stable and will run without problem.

I was able to gather linkerd-controller metrics at the last occurence and forwarded this data to @cpretzer

@jamesallen-vol I've been looking through the logs that you sent and haven't found anything conclusive yet.

The before and after restart output from the metrics server looks to be promising. I hope to find a metric count in there that will show us what may have consumed the memory.

Do you have any other details around what occurred at the time that the controller failed? So far, my attempts to reproduce this haven't yielded anything. I started another run this morning and the memory usage of the destination container has been stable, despite a continuous rollout restart of the deployments in the emojivoto application that I am using to test.

There are four services, and I have two of them that are restarting every minute.

I don't think that this will make a difference, but can you tell me if you're using an ingress controller that is injected with the linkerd proxy?

@jamesallen-vol @cpretzer This sounds like a different problem. Shall we create a separate issue for this? We can explore further there. In particular, I'm interested in knowing when the 503s happen, did the readiness/liveness probes of Service A, B and C and/or their proxies start failing? What does the step increase in the destination's memory usage look like (i.e. maybe the pod is still healthy despite of the memory spike). Did any of the Linkerd control plane pods (controller, prometheus) run into OOMKilled events?

Also, we do not know how to properly debug the apiserver connection issues.

@svenwltr I notice that you are running k8s on custom environment on CoreOS. Do you have access to the api server logs (via journalctl perhaps)? Also curious if you are running any custom CNI, overlay network etc.?

Can you give the new RC a try?

it looks like that the proxy is not "refreshing" the endpoints for the service

The linkerd endpoints command maybe able to help with debugging this.

Sorry for the late response @cpretzer.

Is there any chance you have an estimate of the number of times that the deployments were restarted? Also, how many replicas of the restarted deployment are you running?

This is quite hard to estimate, but it looks like there were roughly 15 pods in total while having 3 replicas.

I notice that you are running k8s on custom environment on CoreOS. Do you have access to the api server logs (via journalctl perhaps)? Also curious if you are running any custom CNI, overlay network etc.?

We have access to the apiserver logs, but the logs from the last failure are already gone. We are using flannel as overlay network, but we do not see any issue with other services.

Can you give the new RC a try?

We will try to reproduce the issue more reliably, first. Otherwise it would take some weeks before figuring out if it solved the problem. Also we will try for get more insights using linkerd endpoints and the apiserver logs.

We just had the problem again. linkerd endpoints always delivered the correct results. (tested controller, api, and the deployments itself)

When this happens, before restarting, can you get a snapshot of the metrics?

kubectl -n linkerd port-forward deploy/linkerd-controller 9996 &
curl localhost:9996/metrics

That'll give a ton of visibility into what's happening there. linkerd endpoints being correct suggests that it isn't a stale connection to the api-server.

Also I do not see anything related in the apiserver logs at this time.

Nevertheless the logs are full of these messages:

ERROR $root.paths./apis/tap.linkerd.io.get is missing required property: responses

and

ERROR $root.paths./apis/tap.linkerd.io/v1alpha1.get is missing required property: responses

Not sure whether this is related, but they appear continuously.

@svenwltr that should be benign and unrelated. Those are from the APIService that the api-server uses to forward requests to tap. The k8s folks haven't published what that API expects and so we've been slowly filling out what it should be.

@cpretzer The jump in memory usage in the linkerd-controller pod appears to lead the 503s by a few minutes. How it currently appears is there will be a jump to above 400MB in combined memory usage across the linkerd-controller pods. when this happens the 503 errors from the backing service start occurring.
@ihcsim there are no failures for the readiness/liveness probes of Service A, B and C or the linkerd-controllers. Likewise none of the pods in the linkerd namespace experience any OOMKilled events. The linkerd-controller pods are passing the readiness/liveness checks during these events. I've tried restarting each individual service without restarting the linkerd-controller when it happens but this didn't clear the problem. It's only when I restart the linkerd-controller that restores normal operation. If there is anything I can check or start monitoring I'm glad to help out @ihcsim @cpretzer . Also, thank you both for all the help with this!

@jamesallen-vol see my comment for what we could use next =)

@grampelberg For sure. I'll grab them again the next time it happens. I did pull those metrics the last time and sent them to Charles.

@jamesallen-vol I and a colleague looked at the logs that you sent over. Unfortunately, there's no obvious cause for the behavior in the log files.

I've been running a test for the last three days in an attempt to reproduce this behavior, and so far the linkerd-controller component has been stable.

I'd like to get some more information from you and @bjoernhaeuser about the kubernetes cluster. Specifically, what are the other pieces of software (controllers, operators, utilities, etc.) that are running in the cluster that might affect operation of pods in the cluster?

For example, are there any reconciliation controllers like flux that would change the state of the system?

@jamesallen-vol @bjoernhaeuser did you specify any options when you ran linkerd install? In my current test, I'm using the base installation, and I'd like to know how your installation is different than mine. Did you specify the --ha option?

@cpretzer I used the --ha option on the install. There are no reconcilliation controllers like flux changing the state of the system. The only operator I'm currently using is a prometheus-operator The other apps are deployed via helm templates and use standard kubernetes objects.

@jamesallen-vol I've been working to reproduce this, but haven't seen the exact behavior. What I have seen is that a node occasionally fails and goes in to a NotReady state, and the description of the node indicates that the kubelet is not reporting metrics about the node.

Unfortunately, I can't get access to the kubelet logs on this EKS cluster, so I'm going to tear it down and create a new cluster.

I also have an identical issue to this.

I restarted the linkerd-controller pods (which fixed the issue) before seeing this comment, but I will do this when it happens again.

Not sure what else I can offer that hasn't already been said, but happy to help where I can. I could probably give access to the cluster if that helps.

I'm running a 3 node dev cluster on digital ocean with --ha enabled.

  • Linkerd Version: stable-2.5.0
  • k8s version: 1.15.2
  • meshed nginx-ingress
  • Also running flux and flagger

@nickjackson thanks for helping to get to to the bottom of this.

How often does this occur?

Does flux manage your linkerd deployments?

@cpretzer no problem.

Hard to say how often it occurs. Every few days something will start 503'ing, even without any deploys. Its a development cluster, so it's only got minimal traffic on it.

I'm not convinced the issue is tied to deployments, but when we do deploy, the issue becomes apparent, probably because the proxies have out of date IP's. When I kubectl exec into an affected pod, and try to curl any of my internal services, it will just 503.

Flux does manage the linkerd deployment, but I haven't made any changes to the linkerd manifests in my flux repo in about two months. Most of the pods in the linkerd namespace have been running for 50 days (with the exception of the controller which I restarted this morning)

Actually, I've just noticed that I get the linkerd2_proxy::app::errors request aborted because it reached the configured dispatch deadline error, not the one in the issue title, but the symptoms are the same.

Thought you should know that I upgraded to 2.6 and am seeing this issue still. :-(

This time it happened to an nginx ingress pod, and we noticed because we deployed a new frontend service, and a 3rd of requests started 503ing. Restarting the linkerd controller fixed the issue.

I've captured the metrics from the linkerd-controller (before restart), along with the logs on that nginx pod. I've sent as a Slack DM to @cpretzer and @grampelberg.

Thanks @nickjackson, I'll look through the info you provided and let you know what I find.

Are you using service profiles in your injected deployments?

I've not got around to that yet. Would that help you out if I did?

No, but thank you for offering.

I ask because someone else experiencing similar behavior was using service profiles and removing them appeared to resolve the issue

@nickjackson where does your cluster live?

@nickjackson where does your cluster live?

DigitalOcean's Kubernetes offering. LON1.

When this happens, before restarting, can you get a snapshot of the metrics?

@grampelberg The service just failed again and we were able to get some metrics. The broken one is the blue-order-silo: https://gist.github.com/svenwltr/8e806e8c27758485def1b9ae348becb1

What's surprising to me about this is that the linkerd endpoints command gives the correct results but restarting the pod (and thus the Linkerd proxy) does not fix the problem. Both of those actions create a brand new watch on the destination service so it's very weird if they give different results.

I'm curious about the ip address that appears in the proxy logs, e.g. No route to host (os error 113) (address: 10.10.3.181:8080)

When the 503s occur, is that IP address a member of the service? is it listed in kubectl get endpoints/<service name>? is it listed in linkerd endpoints? if so, it suggests that the Linkerd proxy is getting the correct endpoint set, but somehow still failing to route to it. If not, it suggests that the Linkerd proxy is somehow not getting the correct information from the destination service. For example, if that IP address corresponds to a recently deleted pod, the proxy may have somehow not received the delete event.

However, when the Linkerd proxy restarts, it should get a full refresh of the endpoint set from the destination service. After restarting the proxy, is the log message the same? Does it still contain an IP address which does not correspond to any pod in the cluster?

@svenwltr Thanks for the gist. The metrics all look healthy, so I'm continuing to try to reproduce on my side. So far, the linkerd-controller pods in my cluster have remained healthy, after running for several days.

@grampelberg: @nickjackson said that this occurred on 2.6.0 and restarting the linkerd-controller deployment resolved the errors. In the initial reply, you describe the proxy and destination interactions with:

proxy (watch) -> destination (watch) -> api-server

If on 2.6.0, restarting the linkerd-controller addresses the issue, then should we be looking at the public-api container instead of destination?

@adleong deleting the pods on a meshed service _does_ fix the issue for me. New pods get spun up, the pods reconnect fine.

It is worth noting, that I used linkerd upgrade to go from 2.4 through to 2.6. Some pods are still using the 2.5 proxy.

I did manage to kubectl exec into a service with a broken proxy a few days ago, and I wasn't able to curl any of my services by DNS.

@nickjackson interesting! It sounds like the issue you are experiencing is different from @jamesallen-vol's issue. In a previous comment he writes:

I've tried restarting each individual service without restarting the linkerd-controller when it happens but this didn't clear the problem. It's only when I restart the linkerd-controller that restores normal operation.

It might be good to get these two issues split into separate github issues so that we don't confuse ourselves.

I agree with @adleong that we should separate these issues out.

After reading through the thread, it seems that @jamesallen-vol and @svenwltr / @bjoernhaeuser are seeing different symptoms caused by the same trigger which is that the linkerd-controller pod becomes inaccessible by the proxy containers after some indeterminate period of time. When this occurs, the only known resolution is to restart the linkerd-controller deployment.

For @nickjackson, restarting the proxies does resolve the issue, which is the main difference. I've opened #3599 to separate the two.

@bjoernhaeuser @svenwltr are you using Service Profiles in your deployments?

@cpretzer We indeed are using service profiles. But it does not look like the services that failed are using them.

@jamesallen-vol @bjoernhaeuser @svenwltr

I'm going to set up another run to try to test this. Can you tell me whether you installed linkerd using the install command or the helm chart?

@cpretzer We used the install command. We checked those files into SCM and are using linkerd upgrade --from-manifest to update those files.

Since I am not sure how much the manifests might diverge, I wrote the manifests into a Gist: https://gist.github.com/svenwltr/639a0f40ad5bdf4801642965a13827a7

@bjoernhaeuser @svenwltr @jamesallen-vol

@adleong and I met to discuss this issue along with #3599 to try to connect the dots on this behavior. Between these issues, we're seeing that the routes in the proxy cache appear to be out of sync from the values of the endpoints in the linkerd-destination control-plane component.

In order to move this forward, we'd to get information about the connection between the linkerd-destination component and a pod that is exhibiting this behavior. To help with collecting this information, we'll use the linkerd-debug container because it contains tools for diagnosing network issues: https://linkerd.io/2/tasks/using-the-debug-container/

Specifically, we'd like to see the output from netstat to get details about the connection between the linkerd-destination pod and a service pod that is failing or is sending requests to an IP address that no longer exists.

There are a few steps to adding the debug sidecar to your application and linkerd-destination deployments, and once those are complete, we'll have to kubectl exec in to the linkerd-debug container in order to add nestat and collect the output. Here's an outline:

  1. Use linkerd inject to add the debug sidecar to your service deployment
  2. Save the yaml for the linkerd-destination deployment to your file system so that we can modify it.
  3. Get the name of the linkerd-destination secret
  4. Update the linkerd-inject deployment yaml to include the debug container and references to the linkerd-destination secret
  5. kubectl exec in to each of the debug containers and install netstat

1 Injecting the debug sidecar container in to your service deployment

Adding the debug sidecar to your service is a matter of using the linkerd inject command on the deployment resource for the service. For example, the command below will modify the yaml to include an annotation that linkerd will use to inject the debug sidecar and write that yaml to a file named my-svc-debug.yml:

kubectl get deploy <your deployment> -n <your namespace> -oyaml | linkerd inject --enable-debug-sidecar > my-svc-debug.yml

You can inspect the file to see that the config.linkerd.io/enable-debug-sidecar: "true" annotation is added to the spec.template.manifest.annotations array in the deployment.

Then apply the file with kubectl apply -f my-svc-debug.yml.

If you want to skip writing the yaml to a file, then you can pipe it directly in to kubectl apply -f:

kubectl get deploy <your deployment> -n <your namespace> -oyaml | linkerd inject --enable-debug-sidecar - | kubect apply -f -

2 Get the yaml for the linkerd-destination deployment

Getting the debug sidecar in to the linkerd-destination service is a more manual process because it is part of the control-plane.

To simplify adding the debug sidecar container to the deployment manifest, I suggest writing the linkerd-destination manifest to a file:

kubectl get deploy linkerd-destination -oyaml -n linkerd > linkerd-destination-debug.yaml

Before we update the yaml file, we'll need to get the name of the Secret that linkerd-destination uses to communicate with other parts of the control plane.

3 Get the name of the Secret resource used by the linkerd-destination deployment

Run the command kubectl get secret -n linkerd | grep destination | cut -d " " -f 1 to get the name of the Secretand you should see output similar to: linkerd-destination-token-7snns. This value will be used in the next step.

4 Update the yaml for the linkerd-destination deployment

Next, use your editor to add the section of yaml below to the spec.template.spec.containers section of the linkerd-destination-debug.yaml file and replace the volumeMounts.name, volumes.name, and volumes.secret.secretName values with the string that you got from the step above.

      - image: gcr.io/linkerd-io/debug:stable-2.6.0
        imagePullPolicy: IfNotPresent
        name: linkerd-debug
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: FallbackToLogsOnError
        volumeMounts:
        - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
          name: linkerd-destination-token-7snns
          readOnly: true
...
      volumes:
      - name: linkerd-destination-token-7snns
        secret:
          defaultMode: 420
          secretName: linkerd-destination-token-7snns

Once the yaml is updated, use kubectl apply -f linkerd-destination-debug.yaml to update the linkerd-destination deployment.

You can confirm that both the service and linkerd-destination deployments have been properly updated when you see that there are three containers in the pod. Here's an example from emojivoto:

NAME                        READY   STATUS    RESTARTS   AGE
emoji-dd84f6f94-p6t6c       3/3     Running   0          89m
vote-bot-5648545766-bpz59   1/1     Running   0          104m
voting-7c846d6bb9-cwsvd     1/1     Running   0          104m
web-78bd8d45b8-sztnr        3/3     Running   0          89m

And the modified linkerd-destination deployment:

NAME                                      READY   STATUS    RESTARTS   AGE
...
linkerd-destination-5446545dc5-2swl5      3/3     Running   0          60m
linkerd-destination-5446545dc5-42l75      3/3     Running   0          60m
linkerd-destination-5446545dc5-bgnb9      3/3     Running   0          60m
... 

5 add netstat to the linkerd-debug sidecar container in each of the deployments

The debug container doesn't have netstat installed, so it's necessary to do that manually.

First, use kubectl exec -it <pod of your service> -c linkerd-debug -- /bin/bash to get a shell to the linkerd-debug container.

Next, run apt update && apt install -y net-tools to install the netstat command.

Finally, verify that netstat is installed with netstat -anp to see all the connections.

After following those steps, you will have netstat installed on the linkerd-debug container which is running as a sidecar to one or more of your services, as well as the linkerd-destination deployment.

Once the behavior occurs again, use kubectl exec -it <pod> -- /bin/bash to get shells to the pod of the service that is failing _as well as_ the linkerd-debug container of a linkerd-destination pod and use netstat to observe the connection between the two.

There's a lot to unpack here, so please reach out to me with any questions.

@bjoernhaeuser @svenwltr @jamesallen-vol @nickjackson

@grampelberg did some digging on this and found that the kube-proxy can get in to an inconsistent state and cause symptoms like those you have seen

The next time that this happens, can you run the script in the gist below and let us know if it detects any kube-proxypods in a bad state?

https://gist.github.com/grampelberg/60c5bc3db8d64f9404fb30bd25ec40b9

Hello @cpretzer. Sorry for the late response.

We will run the script on the next failure. Is the data from the debug-sidecar still useful? We would need to tinker it into our deployment, if this is the case.

@svenwltr yeah, the netstat info from the debug sidecar will be helpful in the event that the script shows that all the kube-proxy pods are healthy

Looking forward to your update ๐Ÿ˜„

@svenwltr any reoccurrences of the incident recently?

I've got a cluster that has been running for a few days and have been updating deployments with new versions, but still haven't triggered the behavior

Hello @cpretzer sorry for the late response.

We just came into an issue and I executed the script you gave us. It looks like it did not find anything:

% python3 ./kube-proxy-health.py
---------- fetching kube-proxy pod names
---------- finding current service update counts
---------- adding stub service
---------- finding updated service counts
---------- cleaning stub service
Everything looks okay!

Unfortunately we do not have the netstat data. We tried to enable the debug sidecar last week, which made some trouble in our cluster, since the linkerd-destination pods where overwhelmed by the traffic it generated. We will try it again, when we have some more time.

We are experiencing a similar issue, but we don't have any No route to host in our logs.

Running an nginx-ingress (That is not meshed) that forwards to a Kong 1.2.0 api-gateway (That is meshed).
I have noticed getting 503 errors from some of the api-gateway pods when curling our other apis from that pod (Simple rest apis running scala applications).
The other apis are reached successfully from each other (And from other api-gateway pods).
When i get the 503 errors the proxy of the api-gateway pod logs the following:

WARN [1030275.972407s] linkerd2_proxy::app::errors request aborted because it reached the configured dispatch deadline

Might have been a fluke, but if when i added Host: example.com to the curl request originally getting 503's, the request went through
and i got 200's. (And didn't log the request aborted line)
Haven't been able to test this any further as the issue hasn't happened since.

Some info about environment:

Running on AWS EKS with Kubernetes version 1.14.
The workers have version v1.14.7-eks-1861c5 and are launched with kubelet args:

    --runtime-cgroups=/kube.slice
    --eviction-hard="memory.available<500Mi,nodefs.available<10%,nodefs.inodesFree<5%"
    --system-reserved="cpu=500m,memory=1Gi"
    --kube-reserved="cpu=1000m,memory=1Gi"
    --kube-reserved-cgroup=/kube.slice
    --system-reserved-cgroup=/system.slice
    --node-labels="kubernetes.io/lifecycle=spot,node-role.kubernetes.io/spot-worker=true,custom/worker-ready=true"

Linkerd 2.5.0 was installed with: linkerd install | kubectl apply -f -, and upgraded to 2.6.0 with linkerd upgrade --ha | kubectl apply --prune -l linkerd.io/control-plane-ns=linkerd -f -
But if my memory serves me correctly our test environment was simply installed with 2.6.0 linkerd install --ha | kubectl apply -f - and it happened there as well.

Doesn't feel like it is related to traffic since our test environment i relatively low traffic.

linkerd endpoints seems to align with kubectl get endpoints of the services.

Ran the script @cpretzer linked when the issues was happening and Everything looks okay!

@svenwltr ah, I was afraid that would happen. The debug container does add an extra sidecar per pod.

Would you mind opening a new issue with the details of how the destination pod was overwhelmed, including the number of pods that you have running the debug container?

@jnatten thanks for the detailed report. This information is really helpful in debugging the issue, especially the fact that adding the Host header resulted in 200s.

Just to clarify, is the host that you used example.com or was it a domain specific to your application?

The warning below means that the request never left the proxy:
WARN [1030275.972407s] linkerd2_proxy::app::errors request aborted because it reached the configured dispatch deadline

I've got a cluster in a similar state, and I also see that running curl, netstat, and dig from the failing container time out, as if the name to the host can't be resolved.

I've been investigating the iptables configurations of the hosts and everything _looks_ okay, and I plan to do more research in to DNS issues.

Another step I took to debug the issue is to ssh to the node and docker exec in to a container and see the same timeout behavior outside the context of kubectl exec. This isn't much of a surprise, but it helps to narrow down the issue.

@cpretzer I believe I could use any domain and the request would go through.

edit: Just verified that it works with any domain when i get the 503.

Our 2.6.1 installation is having this issue as well. We run on AWS EKS 1.14. The issue manifests itself seemingly at random, not necessarily when we update our deploys. In an example deployment of size 3, we saw that 1 pod was unresponsive (503) while the other two were fine. Restarting the unresponsive pod resolved the issue.

An example tap (which slight modifications to obscure details), with the 503:

req id=5:2 proxy=out src=10.90.228.87:60574 dst=10.90.202.42:1400 tls=true :method=POST :authority=service:1400 :path=/Service/Get
rsp id=5:2 proxy=out src=10.90.228.87:60574 dst=10.90.202.42:1400 tls=true :status=503 latency=2170ยตs
end id=5:2 proxy=out src=10.90.228.87:60574 dst=10.90.202.42:1400 tls=true duration=28ยตs response-length=0B

The next time it happens, I'll grab the stats above. Thanks.

thanks @vincentjorgensen anything you can collect will be helpful

Did you observe any errors in the service at 10.90.202.42:1400?

We are also facing the same issue. It is quite annoying. We are running a cluster on gke. A few days it is working, but after almost all requests are failing. Any updates on this?

@psychonetic we're still looking for a reproducible test case. Which versions of Kubernetes and Linkerd are you running?

Any additional detail that you can provide about your environment and the types of workloads you are running will be helpful.

In addition, if you have logs that you can share, I can take a look through them.

The next time it happens, it would be helpful to get some output from the debug container as outlined here.

Newer versions of Linkerd (e.g., edge-20.3.4) have been updated to handle service discovery differently. If you're still experiencing these issues, I recommend annotating your workload with config.linkerd.io/proxy-version: edge-20.3.4. If you test this, please report back! These changes will be released soon in stable-2.7.1.

Noting that 2.7.1 has been released with that fix.

Thanks @jaitaiwan! I'm going to close this out now. If folks run into the same issue with 2.7.1, please open up a new issue with what you're seeing.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

vikas027 picture vikas027  ยท  4Comments

steve-fraser picture steve-fraser  ยท  4Comments

miklezzzz picture miklezzzz  ยท  3Comments

zaharidichev picture zaharidichev  ยท  4Comments

wmorgan picture wmorgan  ยท  3Comments