Linkerd2: [v0.3.1] Seeing intermittent "internal server error" in Dashboard

Created on 16 Mar 2018  ·  29Comments  ·  Source: linkerd/linkerd2

This is on my test cluster, with almost no traffic. Just leaving a deployment page open in my browser, I see this red banner appear/disappear randomly. All my services are behaving properly though.

untitled

I have checked the logs of each container in the controller pod, nothing unusual there.

I'll be happy to provide more info if you tell me where to look!

prioritP1

All 29 comments

The controller now has the conduit-proxy injected as a sidecar for itself too, it's possible it is seeing an error, since it returns a 500 for errors. Any time the proxy returns a 500 from an error internally, it now logs some better description as to what is it turning into a 500 (of course, if the other side returned 500, then the proxy just passes that along as not an internal proxy error...).

So if I wanted to get to the root of this error, where would I look?

@bourquep Try

kubectl logs -n conduit web-$XXXX conduit-proxy

where web-$XXXX is the name of the web pod in the conduit namespace.

Those are the only lines in this log. All conduit pods have a restart count of 0 since installing 0.3.1.

INFO conduit_proxy using controller at HostAndPort { host: DnsName("proxy-api.conduit.svc.cluster.local"), port: 8086 }
INFO conduit_proxy routing on V4(127.0.0.1:4140)
INFO conduit_proxy proxying on V4(0.0.0.0:4143) to None
INFO conduit_proxy::transport::connect "controller-client", DNS resolved DnsName("proxy-api.conduit.svc.cluster.local") to 10.35.241.37
ERR! conduit_proxy::control "controller-client", controller error: Error attempting to establish underlying session layer: operation timed out after Duration { secs: 3, nanos: 0 }
INFO conduit_proxy::transport::connect "controller-client", DNS resolved DnsName("proxy-api.conduit.svc.cluster.local") to 10.35.241.37

Okay, let's look at the logs from the conduit-proxy container in the public-api pod in the conduit namespace.

There is no public-api pod, but there is a public-api container in the controller pod.

Here are the logs of all containers in the controller pod:

$ k logs -n conduit controller-598b867959-5qw44 public-api                                                                                                               
time="2018-03-16T18:01:16Z" level=info msg="running conduit version v0.3.1"
time="2018-03-16T18:01:16Z" level=info msg="serving scrapable metrics on :9995"
time="2018-03-16T18:01:16Z" level=info msg="starting HTTP server on :8085"

$ k logs -n conduit controller-598b867959-5qw44 proxy-api                                                                                                                                             
time="2018-03-16T18:01:17Z" level=info msg="running conduit version v0.3.1"
time="2018-03-16T18:01:17Z" level=info msg="serving scrapable metrics on :9996"
time="2018-03-16T18:01:17Z" level=info msg="starting gRPC server on :8086"

$ k logs -n conduit controller-598b867959-5qw44 conduit-proxy                                                                                                                                         
INFO conduit_proxy using controller at HostAndPort { host: DnsName("localhost"), port: 8086 }
INFO conduit_proxy routing on V4(127.0.0.1:4140)
INFO conduit_proxy proxying on V4(0.0.0.0:4143) to None
INFO conduit_proxy::transport::connect "controller-client", DNS resolved DnsName("localhost") to 127.0.0.1

$ k logs -n conduit controller-598b867959-5qw44 telemetry                                                                                                                                             
time="2018-03-16T18:01:18Z" level=info msg="running conduit version v0.3.1"
time="2018-03-16T18:01:18Z" level=error msg="[pods watcher] error establishing watch: github.com/runconduit/conduit/controller/k8s/pods.go:31: Failed to list *v1.Pod: Get https://10.35.240.1:443/api/v1/pods?limit=500&resourceVersion=0: dial tcp 10.35.240.1:443: connect: connection refused"
time="2018-03-16T18:01:18Z" level=error msg="[pods watcher] error establishing watch: github.com/runconduit/conduit/controller/k8s/pods.go:31: Failed to list *v1.Pod: Get https://10.35.240.1:443/api/v1/pods?limit=500&resourceVersion=0: dial tcp 10.35.240.1:443: connect: connection refused"
time="2018-03-16T18:01:18Z" level=error msg="[pods watcher] error establishing watch: github.com/runconduit/conduit/controller/k8s/pods.go:31: Failed to list *v1.Pod: Get https://10.35.240.1:443/api/v1/pods?limit=500&resourceVersion=0: dial tcp 10.35.240.1:443: connect: connection refused"
time="2018-03-16T18:01:18Z" level=error msg="[pods watcher] error establishing watch: github.com/runconduit/conduit/controller/k8s/pods.go:31: Failed to list *v1.Pod: Get https://10.35.240.1:443/api/v1/pods?limit=500&resourceVersion=0: dial tcp 10.35.240.1:443: connect: connection refused"
time="2018-03-16T18:01:19Z" level=error msg="[pods watcher] error establishing watch: github.com/runconduit/conduit/controller/k8s/pods.go:31: Failed to list *v1.Pod: Get https://10.35.240.1:443/api/v1/pods?limit=500&resourceVersion=0: dial tcp 10.35.240.1:443: connect: connection refused"
time="2018-03-16T18:01:20Z" level=info msg="[pods watcher] initialized"
time="2018-03-16T18:01:20Z" level=info msg="[replicasets watcher] initialized"
time="2018-03-16T18:01:20Z" level=info msg="serving scrapable metrics on:9997"
time="2018-03-16T18:01:20Z" level=info msg="starting gRPC server on 127.0.0.1:8087"

$ k logs -n conduit controller-598b867959-5qw44 destination                                                                                                                                           
time="2018-03-16T18:01:17Z" level=info msg="running conduit version v0.3.1"
E0316 18:01:17.246299       1 reflector.go:205] github.com/runconduit/conduit/controller/k8s/endpoints.go:140: Failed to list *v1.Endpoints: Get https://10.35.240.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 10.35.240.1:443: connect: connection refused
E0316 18:01:18.248587       1 reflector.go:205] github.com/runconduit/conduit/controller/k8s/endpoints.go:140: Failed to list *v1.Endpoints: Get https://10.35.240.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 10.35.240.1:443: connect: connection refused
E0316 18:01:19.249303       1 reflector.go:205] github.com/runconduit/conduit/controller/k8s/endpoints.go:140: Failed to list *v1.Endpoints: Get https://10.35.240.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 10.35.240.1:443: connect: connection refused
time="2018-03-16T18:01:20Z" level=info msg="[endpoints watcher] initialized"
time="2018-03-16T18:01:21Z" level=info msg="[endpoints watcher] initialized"
serving scrapable metrics on :9999
time="2018-03-16T18:01:21Z" level=info msg="starting gRPC server on 127.0.0.1:8089"
time="2018-03-16T18:01:53Z" level=info msg="Establishing watch on endpoint conduit/prometheus:9090"
time="2018-03-16T18:01:53Z" level=info msg="Establishing watch on endpoint conduit/api:8085"
time="2018-03-16T18:05:44Z" level=info msg="Establishing watch on endpoint studyo/generator:5000"
[....]

There is no public-api pod, but there is a public-api container in the controller pod.
Sorry, that's what I meant, my bad.

Looking at the logs you posted, it's possible the 500 error results from the failure to watch pods in the telemetry service, though I'm not too familiar with this part of Conduit, so I'd have to take a look at the code to confirm. I suspect @siggy or @kl might have more insight...

Keep in mind that this does not seem to affect the proxies and proxied services, just some intermittent glitch in the telemetry/dashboard. Absolutely not a big deal from my (user's) point of view, but probably worth investigating nonetheless to make sure it's not a symptom of something worse lurking somewhere. :)

I'll be happy to provide whatever logs/info you might need.

Yeah, I suspect that what's going on here is that the telemetry service is intermittently erroring because it's getting a Connection Refused response from the k8s API when it tries to get a list of pods to associate metrics with, and briefly fails requests for new metrics states from the frontend. Then the request goes through again, and the banner goes away.

Thanks for sharing these logs. The telemetry service's logs look normal to me, and they don't indicate that the service has failed to serve any requests. This message is expected:

time="2018-03-16T18:01:18Z" level=error msg="[pods watcher] error establishing watch: github.com/runconduit/conduit/controller/k8s/pods.go:31: Failed to list *v1.Pod: Get https://10.35.240.1:443/api/v1/pods?limit=500&resourceVersion=0: dial tcp 10.35.240.1:443: connect: connection refused"

It's only logged at startup, when the service is attempting to read from the kubernetes API for the first time. It's expected that the first few requests will fail due to the changes from #365, so the process just logs the failure and retries.

If you see this happen again, could you try finding the error response in your web inspector's network tab? I suspect that the response body would give us some indication of where the failure is happening.

The telemetry system has been replaced entirely in v0.4.0. I'd be surprised if this behavior persists, but it would be good to confirm it does not once the release is out.

I've been seeing this issue as well recently. It appears that the XHR request to /api/stat is returning a 500, which triggers the banner. The 500 response body is:

{"error":"Post http://api.conduit.svc.cluster.local:8085/api/v1/StatSummary: http: server closed idle connection"}

So that indicates that the request from the web pod to the controller pod is failing. Sure enough I see the same error printed in the web container's log:

$ kubectl -n conduit logs web-657b744c64-mzkvz web
...
http://api.conduit.svc.cluster.local:8085/api/v1/StatSummary: http: server closed idle connection"

But I don't see any other relevant info in the web pod's outbound proxy logs, or the controller pod's inbound proxy logs, or the public-api logs.

It's also worth noting that the web pod's inbound proxy stats reflect decreased success rate:

$ conduit stat -n conduit deploy/web -t 1h 
NAME   MESHED   SUCCESS      RPS   LATENCY_P50   LATENCY_P95   LATENCY_P99   SECURED
web       1/1    99.54%   0.3rps           5ms          10ms          10ms        0%

But the outbound stats from the web to the controller pod do not:

$ conduit stat -n conduit deploy --from deploy/web -t 1h 
NAME         MESHED   SUCCESS      RPS   LATENCY_P50   LATENCY_P95   LATENCY_P99   SECURED
controller      1/1   100.00%   0.2rps           0ms           0ms           0ms        0%

This seems to still exist in the new linkerd UI as well. Didn't seem to affect anything just strange :).
screen shot 2018-07-18 at 6 05 57 pm

Based on the error message, it seems like we could set the public API client's IdleConnTimeout to 0 (unlimited), and that might fix this issue. The default IdleConnTimeout is 90 seconds.

    // IdleConnTimeout is the maximum amount of time an idle
    // (keep-alive) connection will remain idle before closing
    // itself.
    // Zero means no limit.
    IdleConnTimeout time.Duration

That would actually make a lot of sense, I can open the dashboard and set a timer if you'd like @klingerf to see if that would solve it…

@christopherhein In the process of removing the client's idle connection timeout, I realized that the error message indicates that the server is closing the connection, not the client. So I think what we're actually seeing is an instance of https://github.com/golang/go/issues/19943#issuecomment-355607646. All of our public API requests are POSTs, which means that the go client won't retry on server connection close, making us susceptible to this behavior. Will dig around a bit more to see if there's a different fix we can implement on the client side.

Ah, that is very interesting. It seems to recover itself the next time around, in my mind that would mean it could be something we catch and retry, but that is without fully understanding the context @klingerf

@christopherhein Ok cool, I added a client retry in kl/retry-server-close, and I've published linkerd images from that branch. Mind giving it a try in your environment?

linkerd install --linkerd-version git-4470c8fb | kubectl apply -f -

I've never had a lot of luck reproducing this failure in my env. If it seems to work for you, I'll put this change up for review.

Sounds great, let me bring the env back up. Just to be clear when I use the --linkerd-version that will grab that version from Github, I don't need to update my local binary?

That's right -- the --linkerd-version flag will install that version in the linkerd namespace in your kubernetes cluster. It won't affect your CLI version, but the change that I made only impacts the web server, so it should be no prob to install a new controller using the older CLI.

Awesome, so I got that running, but it still seems to error. I'm getting on both /tps-reports and api/pods is there any chance this is trying to use metrics from the API server /metrics endpoint?

And has anyone experienced this on anything other than EKS…

@christopherhein Thanks for testing it out! Too bad this doesn't fix it, but will keep investigating. Fwiw I think there may be two separate issues that trigger the red error banner. Sometimes I see that error immediately on initial page load, and other times I see it pop up sporadically after the page has been running for a while. I was expecting this branch to fix the latter error, and I'm still not sure what causes the former.

Hmmm, interesting for me it comes up just at some interval I don't really see it on first boot, I'll check it out and see if I can set a timer and see if it correlates to the 90-second timeout, even though I think that's what you removed.

Second test, just opened up the dashboard and for TPS Reports endpoint it might have been fixed, I'm still getting it on /api/pods I also set a timer and it's every ~25seconds

Hey @christopherhein, really sorry for the delay here. Thanks for testing out those builds last week!

I've done some testing in a GKE cluster, and it appears to me that the "server closed idle connection" error was fixed as part of the v18.7.2 release that we cut earlier this week. Specifically, linkerd/linkerd2-proxy#26 was included in the release, and that fixed an issue with HTTP connection reuse, which I suspect was triggering the "server closed idle connection" error.

Can you give the v18.7.2 release a shot to see if it fixes the issue for you as well? You can install the updated CLI with:

curl https://run.conduit.io/install? | sh

Note that we display the red error banner in the web UI whenever a 5xx error is returned from the web server, and the fix in v18.7.2 only addresses one source of 5xx errors. I've opened #1366 to track adding more information to the error banner, which will help us distinguish between different types of errors going forward. In the meantime, if you see a 5xx error and are able to paste the response body into this issue, that would be super helpful. Thanks!

This is fantastic, thank you @klingerf I've been running it silently in the background for the last 5 minutes and nothing as showed up in console yet… 👏 great work!

@christopherhein Great news, glad to hear it!

I'm going to go ahead and close this. Thanks again @christopherhein for your help tracking it down. We can open separate issues if the banner reappears, hopefully with more specificity once #1366 is fixed.

Was this page helpful?
0 / 5 - 0 ratings