Linkerd2: Unclear error message in linkerd2_proxy logs: "error trying to connect: request timed out"

Created on 23 Oct 2019  路  9Comments  路  Source: linkerd/linkerd2

Bug Report

What is the issue?

We have several dozens of services deployed to Kubernetes and randomly get "request timed out" in logs of random pods while the service is working and all health checks are okay.

How can it be reproduced?

No idea. After a pod started, after a while the error message starts appearing in a pod logs (one random of many). When I enable "debug" mode for Linkerd proxy, the pod restarts and the message disappears. I couldn't get the error from a pod with enabled debug mode.

Logs, error output, etc

linkerd-proxy ERR! [  4300.896499s] linkerd2_proxy::app::errors unexpected error: error trying to connect: request timed out

linkerd check output

All is green except that it is running version 19.9.3 but the latest edge version is 19.10.3

Environment

  • Kubernetes Version:
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.10-eks-825e5d", GitCommit:"825e5de08cb05714f9b224cd6c47d9514df1d1a7", GitTreeState:"clean", BuildDate:"2019-08-18T03:58:32Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
  • Cluster Environment: EKS
  • Host OS: Amazon Linux
  • Linkerd version:
Client version: edge-19.9.3
Server version: edge-19.9.3

Possible solution

Please make the error more informative.

Additional context

bug needrepro

Most helpful comment

But it would be nice to have a little bit more informative message in the logs or an ability to enable debug mode without restarting the proxy.

FWIW, debug logging can be enabled without restarting the proxy by running a command like this:

:; kubectl exec -n $ns $pod -- curl -X PUT --data 'info,linkerd2=debug' localhost:4191/proxy-log-level

where $ns and $pod are the namespace and pod of the proxy you want to change the log level for.

All 9 comments

This is totally pattern matching, but as I see you're on EKS, have you been having any DNS issues? We've seen folks run into somewhat similar issues because DNS is timing out (error messages are in the progress of getting better!).

Unfortunately I cannot research the problem because I don't see what exactly it tries to connect to. The applications is working fine when I see the problems in Linkerd. I feel that DNS is a very possible suspect. DNS monitoring shows some weirdnesses in charts but they are not directly correlate with the issues I have because after the pod restarted the DNS charts doesn't really react. But it would be nice to have a little bit more informative message in the logs or an ability to enable debug mode without restarting the proxy.

The logging is going to get way better in the next-ish edge, so that'll be helpful.

Just for history. I tried to restart all Linkerd pods, all coredns pods, an awsnode pod on the host where I got the error logs. Nothing helped. After I tried to inspect proxy process to understand what it tries to reach but the pod was restarted by K8s in the middle, so it is still not clear.

@ihcsim what's the slow DNS setting on AWS?

@grampelberg The VPC DHCP option sets where some users replace the Amazon EC2 default domain name and name server with their own. This is only relevant if @KIVagant set up his own VPC.

@KIVagant One recent case I have seen is that the timeout happens when pods try to talk to other pods on different nodes, in different subnet. Did you configure your own security groups and/or network ACL?

But it would be nice to have a little bit more informative message in the logs or an ability to enable debug mode without restarting the proxy.

FWIW, debug logging can be enabled without restarting the proxy by running a command like this:

:; kubectl exec -n $ns $pod -- curl -X PUT --data 'info,linkerd2=debug' localhost:4191/proxy-log-level

where $ns and $pod are the namespace and pod of the proxy you want to change the log level for.

@hawkw , thank you!
With the command you provided I found more details of the problem and it is related to APM monitoring. Tracing port is not available (not a linkerd issue, I believe).

DBUG [ 41486.851052s] linkerd2_proxy::transport::connect connecting to 10.56.37.70:8126
...
ERR! [ 41487.490315s] linkerd2_proxy::app::errors unexpected error: error trying to connect: request timed out

The first line contains the host and port and is printed only in DBUG mode. It would be very helpful if the same info was displayed in the second line as well.

Feel free to close the issue if the logging will be generally changed and this small case is not relevant anymore.

Looks similar to https://github.com/DataDog/datadog-agent/issues/2932

@ihcsim, yes, the EKS cluster is setup into its own VPC but everything works well. And from the logs I found with debug mode now it is became obvious that it is not a DNS problem. After I restarted the Datadog pod that was running on the host it started working again. The weird thing here is that when a Datadog container was restarted but not the pod, the problem was still there. But now it is a problem that looks more understandable.

Thank you all! Without Linkerd it would be much harder even to notice the issue.

@KIVagant Glad to hear you figured this out. Thanks for the feedback on the logs output. Closing this issue.

Was this page helpful?
0 / 5 - 0 ratings