Linkerd2: load balancer discovery error: discovery task failed

Created on 16 Jan 2020 · 30Comments · Source: linkerd/linkerd2

Bug Report

What is the issue?

At some point linkerd-proxy can't reach out some service (may be it was really disappeared) and then it can't connect to it until proxy restart.

How can it be reproduced?

Not sure.

Logs, error output, etc

linkerd-proxy logs:

WARN [188674.700387s] outbound:accept{peer.addr=10.0.6.112:59298}:source{target.addr=10.0.17.202:80}:logical{addr=service-name:80}:making:profile:balance{addr=service-name.default.svc.cluster.local:80}: linkerd2_proxy_discover::buffer dropping resolution due to watchdog timeout timeout=60s
ERR! [262801.414146s] outbound:accept{peer.addr=10.0.6.112:59282}:source{target.addr=10.0.17.197:80}: linkerd2_app_core::errors unexpected error: Inner("load balancer discovery error: discovery task failed")
ERR! [262857.738884s] outbound:accept{peer.addr=10.0.6.112:33972}:source{target.addr=10.0.17.197:80}: linkerd2_app_core::errors unexpected error: Inner("load balancer discovery error: discovery task failed")
ERR! [262959.610891s] outbound:accept{peer.addr=10.0.6.112:39410}:source{target.addr=10.0.17.197:80}: linkerd2_app_core::errors unexpected error: Inner("load balancer discovery error: discovery task failed")

`linkerd check` output

Status check results are √

Environment

Kubernetes Version: 1.15.5
Cluster Environment: AKS
Host OS: Ubuntu 16.04.6 LTS
Linkerd version: edge-20.1.1

Possible solution

Additional context

Service that proxy can't reach out is service without selector and it's not meshed.

areproxy bug

Source

StupidScience

👍10

Most helpful comment

We're also having this issue. It happens intermittently for some pods. When you try to access the other service directly using their Cluster IP, you get a response, however, if you use the service's name, you get a 502 Bad Gateway error.

The LinkerD proxy sidecar shows the following logs:

[ 59763.318176661s]  WARN outbound:accept{peer.addr=100.110.0.6:57546}:source{target.addr=100.66.25.52:80}: linkerd2_app_core::errors: Failed to proxy request: buffered service failed: load balancer discovery error: discovery task failed
[ 60010.715741422s]  WARN outbound:accept{peer.addr=100.110.0.6:60360}:source{target.addr=100.66.25.52:80}: linkerd2_app_core::errors: Failed to proxy request: buffered service failed: load balancer discovery error: discovery task failed

logbon72 on 24 Mar 2020

👍2

All 30 comments

Would you mind testing with stable-2.6.1 and seeing if you get the same behavior?

grampelberg on 16 Jan 2020

@grampelberg I'm using cert-manager integration and also I'm not sure if #3470 is in stable-2.6.1 so this is important for me.

StupidScience on 21 Jan 2020

We moved the proxy back last week, so give last week's edge a try and see if that fixes it.

grampelberg on 21 Jan 2020

In our test env we also saw this in stable-2.7.0:

[ 81965.930575036s]  WARN outbound:accept{peer.addr=10.0.0.157:50678}:source{target.addr=172.20.192.163:8100}: linkerd2_app_core::errors: Failed to proxy request: buffered service failed: load balancer discovery error: discovery task failed
[ 81977.951660481s]  WARN outbound:accept{peer.addr=10.0.0.157:58822}:source{target.addr=172.20.192.163:8100}: linkerd2_app_core::errors: Failed to proxy request: buffered service failed: load balancer discovery error: discovery task failed

We are not using cert-manager integration.

Rolled back to 2.6.1.

tarvip on 13 Feb 2020

👍2

We're seeing the same issue in stable-2.7.0

[ 57036.647525313s]  WARN outbound:accept{peer.addr=10.0.20.153:56106}:source{target.addr=10.3.245.54:80}: linkerd2_app_core::errors: Failed to proxy request: buffered service failed: load balancer discovery error: discovery task failed
[ 57099.306255914s]  WARN outbound:accept{peer.addr=10.0.20.153:56106}:source{target.addr=10.3.245.54:80}: linkerd2_app_core::errors: Failed to proxy request: buffered service failed: load balancer discovery error: discovery task failed
[ 57101.168040710s]  WARN outbound:accept{peer.addr=10.0.20.153:56106}:source{target.addr=10.3.245.54:80}: linkerd2_app_core::errors: Failed to proxy request: buffered service failed: load balancer discovery error: discovery task failed

charlierm on 14 Feb 2020

we also rolled back to 2.6.1 for now.

christianhuening on 15 Feb 2020

We had the same issue in stable-2.7.0. Rolled back to 2.6.1 and the problem didn't happen again.

jbuet on 17 Feb 2020

We rolled back this morning, not seen any issues since.

charlierm on 17 Feb 2020

We experienced these same symptoms when the default memory limit that shipped with helm was too low for the control plane components on our clusters of > 1k pods (smaller clusters were fine, and didn't demonstrate this problem). increasing the limit from 250Mi to 500Mi on both the destination & controller deployments seems to have helped.

mmiller1 on 18 Feb 2020

👍2

Any update? I got the same issue after running load test.
Thanks.

truongces on 24 Feb 2020

I've handled gracefully shutdown the pods, this issue was gone.
I'm not sure if it could help you guys.

truongces on 24 Feb 2020

Can confirm this happening quite a few times for 2.7.0 in GCP during load tests as well for gRPC and HTTP.
The tested pod accepts HTTP requests as input and makes gRPC requests to a backend service.
Tester pod uses the K8S service name as the endpoint. After a few runs with pods and nodes scaling out and in, the traffic is no longer routed although no changes have been made to the service declarations/anything.
Tester pod gets 502s (despite kubectl get endpoints showing correct pods for the service) but is able to reach the backends.
Tested pods stop getting HTTP requests and throw a lot of gRPC errors like grpclib.exceptions.GRPCError: (<Status.INTERNAL: 13>, 'buffered service failed: load balancer discovery error: discovery task failed') when trying to reach the backend services.
Changing the ports on the service declarations fixes the issue, however, it's pretty "sticky" (reverting to old port doesn't get requests routed again).

vasi-grigo on 26 Feb 2020

@vasi-grigo this is interesting info and I have some questions:

approximately how many runs did you do?
when the pods and nodes scaled in and out, did you scale them down to zero at any time?

cpretzer on 28 Feb 2020

Hard to say exactly but some 2-3 runs. Think that the key thing is to have a lot of scale in/out events happen.
Nope; both autoscaling and HPAs are configured for a min of 2+ (nodes and pods correspondingly).

vasi-grigo on 29 Feb 2020

We have also experienced this issue on 2.7.0. We've downgraded to 2.6.1 and it appears to be ok now

warwick-mitchell1 on 4 Mar 2020

@warwick-mitchell1 @vasi-grigo @StupidScience the latest edge has some improvements to the proxy and it would be really helpful if you can give feedback with that version

cpretzer on 7 Mar 2020

@cpretzer We installed the latest edge version (edge-20.3.1) replacing 2.7.0 about 2 days back and we have not seen the service discovery issue since. Our mesh has about 50 services and 2k/s requests so we got hit hard with that bug :).

heimdull on 13 Mar 2020

That's great to hear. Please update this issue if you see any funny behavior.

The changes to the proxy will be included in the next stable release.

cpretzer on 13 Mar 2020

awesome!
@cpretzer is there an ETA for the next stable one?

christianhuening on 13 Mar 2020

Also having this issue with stable-2.7.0 is there an easy to use downgrade option in linkerd?

sevaho on 18 Mar 2020

@sevaho If you're in a position to use the edge release, I'd suggest going forward.

That being said, you can use the linkerd upgrade command to roll back to a previous version.

I just tested this by using the 2.6.1 CLI on a cluster that was running 2.7.0. You can run the command linkerd upgrade | kubectl apply -f -

cpretzer on 18 Mar 2020

The LinkerD proxy sidecar shows the following logs:

[ 59763.318176661s]  WARN outbound:accept{peer.addr=100.110.0.6:57546}:source{target.addr=100.66.25.52:80}: linkerd2_app_core::errors: Failed to proxy request: buffered service failed: load balancer discovery error: discovery task failed
[ 60010.715741422s]  WARN outbound:accept{peer.addr=100.110.0.6:60360}:source{target.addr=100.66.25.52:80}: linkerd2_app_core::errors: Failed to proxy request: buffered service failed: load balancer discovery error: discovery task failed

logbon72 on 24 Mar 2020

👍2

@logbon72 Thanks for the report. Which version of Linkerd are you using?

If you're in a position to try the latest edge, that would be really helpful!

cpretzer on 24 Mar 2020

We were using version 2.7.0-stable, now downgraded to 2.6.1-stable.

Unfortunately can't test with an edge version at the moment as we are pressed for time to rollout, but I can try this sometime next week.

logbon72 on 24 Mar 2020

Thanks @logbon72. We should have another edge out this week, so testing next week would be awesome. ⭐️

cpretzer on 24 Mar 2020

This appears to have bitten us as well, thankfully during the bake-in period on our preproduction cluster. We've rolled back to 2.6.1 and the problem appears to have vanished.
proxylog.txt

memory on 25 Mar 2020

We're seeing the same thing on 2.7.0. Happens very intermittently and not with all services.

futurekill on 25 Mar 2020

will this be fixed in the 2.7.1 release?

jvandemark on 9 Apr 2020

@jvandemark we'd love it if you tested the latest edge and gave us some feedback. We've definitely fixed it for the cases that we can reproduce locally.

grampelberg on 9 Apr 2020

This should be fixed now with stable-2.7.1. Please reopen if you're continuing to see this issue!

grampelberg on 13 Apr 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Publish Helm chart to Helm Hub

ihcsim · 4Comments

change css variables for color to be more descriptive

franziskagoltz · 3Comments

Wire up stats and dashboards for Jobs

klingerf · 3Comments

Add support for rate limiting

steve-fraser · 4Comments

proxy: Measure allocator differences in Rust 1.32

olix0r · 3Comments

Linkerd2: load balancer discovery error: discovery task failed

Bug Report

What is the issue?

How can it be reproduced?

Logs, error output, etc

linkerd check output

Environment

Possible solution

Additional context

Most helpful comment

All 30 comments

Related issues

`linkerd check` output