Linkerd2: load balancer discovery error: discovery task failed

Created on 16 Jan 2020  ·  30Comments  ·  Source: linkerd/linkerd2

Bug Report

What is the issue?

At some point linkerd-proxy can't reach out some service (may be it was really disappeared) and then it can't connect to it until proxy restart.

How can it be reproduced?

Not sure.

Logs, error output, etc

linkerd-proxy logs:

WARN [188674.700387s] outbound:accept{peer.addr=10.0.6.112:59298}:source{target.addr=10.0.17.202:80}:logical{addr=service-name:80}:making:profile:balance{addr=service-name.default.svc.cluster.local:80}: linkerd2_proxy_discover::buffer dropping resolution due to watchdog timeout timeout=60s
ERR! [262801.414146s] outbound:accept{peer.addr=10.0.6.112:59282}:source{target.addr=10.0.17.197:80}: linkerd2_app_core::errors unexpected error: Inner("load balancer discovery error: discovery task failed")
ERR! [262857.738884s] outbound:accept{peer.addr=10.0.6.112:33972}:source{target.addr=10.0.17.197:80}: linkerd2_app_core::errors unexpected error: Inner("load balancer discovery error: discovery task failed")
ERR! [262959.610891s] outbound:accept{peer.addr=10.0.6.112:39410}:source{target.addr=10.0.17.197:80}: linkerd2_app_core::errors unexpected error: Inner("load balancer discovery error: discovery task failed")

linkerd check output

Status check results are √

Environment

  • Kubernetes Version: 1.15.5
  • Cluster Environment: AKS
  • Host OS: Ubuntu 16.04.6 LTS
  • Linkerd version: edge-20.1.1

Possible solution

Additional context

Service that proxy can't reach out is service without selector and it's not meshed.

areproxy bug

Most helpful comment

We're also having this issue. It happens intermittently for some pods. When you try to access the other service directly using their Cluster IP, you get a response, however, if you use the service's name, you get a 502 Bad Gateway error.

The LinkerD proxy sidecar shows the following logs:

[ 59763.318176661s]  WARN outbound:accept{peer.addr=100.110.0.6:57546}:source{target.addr=100.66.25.52:80}: linkerd2_app_core::errors: Failed to proxy request: buffered service failed: load balancer discovery error: discovery task failed
[ 60010.715741422s]  WARN outbound:accept{peer.addr=100.110.0.6:60360}:source{target.addr=100.66.25.52:80}: linkerd2_app_core::errors: Failed to proxy request: buffered service failed: load balancer discovery error: discovery task failed

All 30 comments

Would you mind testing with stable-2.6.1 and seeing if you get the same behavior?

@grampelberg I'm using cert-manager integration and also I'm not sure if #3470 is in stable-2.6.1 so this is important for me.

We moved the proxy back last week, so give last week's edge a try and see if that fixes it.

In our test env we also saw this in stable-2.7.0:

[ 81965.930575036s]  WARN outbound:accept{peer.addr=10.0.0.157:50678}:source{target.addr=172.20.192.163:8100}: linkerd2_app_core::errors: Failed to proxy request: buffered service failed: load balancer discovery error: discovery task failed
[ 81977.951660481s]  WARN outbound:accept{peer.addr=10.0.0.157:58822}:source{target.addr=172.20.192.163:8100}: linkerd2_app_core::errors: Failed to proxy request: buffered service failed: load balancer discovery error: discovery task failed

We are not using cert-manager integration.

Rolled back to 2.6.1.

We're seeing the same issue in stable-2.7.0

[ 57036.647525313s]  WARN outbound:accept{peer.addr=10.0.20.153:56106}:source{target.addr=10.3.245.54:80}: linkerd2_app_core::errors: Failed to proxy request: buffered service failed: load balancer discovery error: discovery task failed
[ 57099.306255914s]  WARN outbound:accept{peer.addr=10.0.20.153:56106}:source{target.addr=10.3.245.54:80}: linkerd2_app_core::errors: Failed to proxy request: buffered service failed: load balancer discovery error: discovery task failed
[ 57101.168040710s]  WARN outbound:accept{peer.addr=10.0.20.153:56106}:source{target.addr=10.3.245.54:80}: linkerd2_app_core::errors: Failed to proxy request: buffered service failed: load balancer discovery error: discovery task failed

we also rolled back to 2.6.1 for now.

We had the same issue in stable-2.7.0. Rolled back to 2.6.1 and the problem didn't happen again.

We rolled back this morning, not seen any issues since.

We experienced these same symptoms when the default memory limit that shipped with helm was too low for the control plane components on our clusters of > 1k pods (smaller clusters were fine, and didn't demonstrate this problem). increasing the limit from 250Mi to 500Mi on both the destination & controller deployments seems to have helped.

Any update? I got the same issue after running load test.
Thanks.

I've handled gracefully shutdown the pods, this issue was gone.
I'm not sure if it could help you guys.

Can confirm this happening quite a few times for 2.7.0 in GCP during load tests as well for gRPC and HTTP.
The tested pod accepts HTTP requests as input and makes gRPC requests to a backend service.
Tester pod uses the K8S service name as the endpoint. After a few runs with pods and nodes scaling out and in, the traffic is no longer routed although no changes have been made to the service declarations/anything.
Tester pod gets 502s (despite kubectl get endpoints showing correct pods for the service) but is able to reach the backends.
Tested pods stop getting HTTP requests and throw a lot of gRPC errors like grpclib.exceptions.GRPCError: (<Status.INTERNAL: 13>, 'buffered service failed: load balancer discovery error: discovery task failed') when trying to reach the backend services.
Changing the ports on the service declarations fixes the issue, however, it's pretty "sticky" (reverting to old port doesn't get requests routed again).

@vasi-grigo this is interesting info and I have some questions:

  • approximately how many runs did you do?
  • when the pods and nodes scaled in and out, did you scale them down to zero at any time?
  • Hard to say exactly but some 2-3 runs. Think that the key thing is to have a lot of scale in/out events happen.
  • Nope; both autoscaling and HPAs are configured for a min of 2+ (nodes and pods correspondingly).

We have also experienced this issue on 2.7.0. We've downgraded to 2.6.1 and it appears to be ok now

@warwick-mitchell1 @vasi-grigo @StupidScience the latest edge has some improvements to the proxy and it would be really helpful if you can give feedback with that version

@cpretzer We installed the latest edge version (edge-20.3.1) replacing 2.7.0 about 2 days back and we have not seen the service discovery issue since. Our mesh has about 50 services and 2k/s requests so we got hit hard with that bug :).

That's great to hear. Please update this issue if you see any funny behavior.

The changes to the proxy will be included in the next stable release.

awesome!
@cpretzer is there an ETA for the next stable one?

Also having this issue with stable-2.7.0 is there an easy to use downgrade option in linkerd?

@sevaho If you're in a position to use the edge release, I'd suggest going forward.

That being said, you can use the linkerd upgrade command to roll back to a previous version.

I just tested this by using the 2.6.1 CLI on a cluster that was running 2.7.0. You can run the command linkerd upgrade | kubectl apply -f -

We're also having this issue. It happens intermittently for some pods. When you try to access the other service directly using their Cluster IP, you get a response, however, if you use the service's name, you get a 502 Bad Gateway error.

The LinkerD proxy sidecar shows the following logs:

[ 59763.318176661s]  WARN outbound:accept{peer.addr=100.110.0.6:57546}:source{target.addr=100.66.25.52:80}: linkerd2_app_core::errors: Failed to proxy request: buffered service failed: load balancer discovery error: discovery task failed
[ 60010.715741422s]  WARN outbound:accept{peer.addr=100.110.0.6:60360}:source{target.addr=100.66.25.52:80}: linkerd2_app_core::errors: Failed to proxy request: buffered service failed: load balancer discovery error: discovery task failed

@logbon72 Thanks for the report. Which version of Linkerd are you using?

If you're in a position to try the latest edge, that would be really helpful!

We were using version 2.7.0-stable, now downgraded to 2.6.1-stable.

Unfortunately can't test with an edge version at the moment as we are pressed for time to rollout, but I can try this sometime next week.

Thanks @logbon72. We should have another edge out this week, so testing next week would be awesome. ⭐️

This appears to have bitten us as well, thankfully during the bake-in period on our preproduction cluster. We've rolled back to 2.6.1 and the problem appears to have vanished.
proxylog.txt

We're seeing the same thing on 2.7.0. Happens very intermittently and not with all services.

will this be fixed in the 2.7.1 release?

@jvandemark we'd love it if you tested the latest edge and gave us some feedback. We've definitely fixed it for the cases that we can reproduce locally.

This should be fixed now with stable-2.7.1. Please reopen if you're continuing to see this issue!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ihcsim picture ihcsim  ·  4Comments

franziskagoltz picture franziskagoltz  ·  3Comments

klingerf picture klingerf  ·  3Comments

steve-fraser picture steve-fraser  ·  4Comments

olix0r picture olix0r  ·  3Comments