Linkerd2: Linkerd 2.4.0: request aborted because it reached the configured dispatch deadline

Created on 20 Jul 2019 · 35Comments · Source: linkerd/linkerd2

Bug Report

What is the issue?

All deployments where linkerd was injected errored out at the same time with:

proxy={server=out listen=127.0.0.1:4140 remote=10.10.45.84:41846} linkerd2_proxy::app::errors request aborted because it reached the configured dispatch deadline

How can it be reproduced?

I do not really know, run 2.4.0 in a environment where there is a lot of container churn and only some deployments are meshed. It took around 12 hours of runtime (constant traffic) to cause the error.

Logs, error output, etc

There is the mentioned log message in all deployments. Otherwise there is nothing.

`linkerd check` output

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API

linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ no invalid service profiles

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

control-plane-version
---------------------
√ control plane is up-to-date
√ control plane and cli versions match

Status check results are √

Environment

Kubernetes Version: 1.14.1
Cluster Environment: self managed
Host OS: CoreOs 2079.3.0
Linkerd version: 2.4.0

Possible solution

Additional context

bug needrepro prioritP0

Source

bjoernhaeuser

👍1

All 35 comments

@bjoernhaeuser Thanks for bringing this to our attention. A few questions:

Are you seeing these warnings on both the client and server sides?
Any insights into the proxy container's resource utilization when this happens? Curious what it looks like after running for 12 hours in your environment.
Re _"There is the mentioned log message in all deployments."_, you mean you are seeing the request aborted because it reached the configured dispatch deadline message in your application log?

Further information from our _slack_ convo (for others' reference),

This is all HTTP traffic
AIUI, _container churn_ in your environment refers to rapid auto-scaling of the containers

ihcsim on 20 Jul 2019

Are you seeing these warnings on both the client and server sides?

On both sides.

Any insights into the proxy container's resource utilization when this happens? Curious what it looks like after running for 12 hours in your environment.

Next to 0 cpu usage (< 0.04 cores), stable memory usage between 20mb and 50mb, depending on the deployment.

Re "There is the mentioned log message in all deployments.", you mean you are seeing the request aborted because it reached the configured dispatch deadline message in your application log?

No, not in the application logs. There I only see 503. What I meant is that all deployments which are meshed showed the same error message.

bjoernhaeuser on 20 Jul 2019

We suspect this might be related to https://github.com/linkerd/linkerd2/issues/2839. To help us confirm, when the problem happens again, can you grab a snapshot of the resource footprint of the destination container with:

linkerd -n linkerd metrics po/<linkerd-controller-pod-name>

If the destination container's memory usage is indeed very high, the current workaround is to restart the Linkerd controller pod. Keep us posted.

ihcsim on 22 Jul 2019

In addition to the command that @ihcsim suggested (which will provide proxy metrics from the control plane) it would also be useful to see metrics directly from the destination service itself. You can get those by running:

kubectl -n linkerd port-forward deploy/linkerd-controller 9996 &
curl localhost:9996/metrics

adleong on 22 Jul 2019

I'm also experiencing this. Our services can trigger it in a few minutes with gRPC Unary calls.

Here are the requested metrics.

kdelorey on 29 Jul 2019

@kdelorey you wouldn't happen to have a small example that we could run as well?

grampelberg on 30 Jul 2019

@kdelorey would you mind opening a new issue? I know that you're experiencing similar problems to what's described in this ticket, but I don't want to assume that the situation is identical. If it turns out that both issues have the same root cause, we can resolve both at once. Please make sure to include the controller metrics as described in my previous comment in addition to the proxy metrics you shared. Thanks!

adleong on 30 Jul 2019

@bjoernhaeuser any luck getting those destination service metrics?

adleong on 30 Jul 2019

@grampelberg I'll try to reproduce it with a small example and include it with a new issue based on @adleong's feedback.

kdelorey on 31 Jul 2019

@bjoernhaeuser any luck getting those destination service metrics?

We currently disabled linkerd in our system since the last outage. We will reenable it soon for a minor service and hope to get the metrics. Stay tuned! :)

bjoernhaeuser on 31 Jul 2019

@bjoernhaeuser we really want to get this working for you. Standing by eagerly

wmorgan on 31 Jul 2019

I chatted with Sven on the Linkerd slack, he says they'll be able to add Linkerd back to a less-critical service next week and get us some metrics data.

wmorgan on 9 Aug 2019

Hi @bjoernhaeuser we've just released an edge that we hope has a fix for this issue, please give it a try and let us know what you find.

https://github.com/linkerd/linkerd2/releases/tag/edge-19.8.3

admc on 15 Aug 2019

FYI, I was also seeing those errors in the proxy logs, and have had 0 occurrences since upgrading to the latest edge release.

bourquep on 20 Aug 2019

Oh thats good news! This week we enabled linkerd again, by splitting up deployments, so that only 33% to 50% of the traffic is served through/with linkerd enabled.

We are still using 2.4.0 and we will update this issue when we have new findings!

bjoernhaeuser on 20 Aug 2019

@admc can you refer me to the PR which has the fix for this which is included in the latest edge? Thanks!

bjoernhaeuser on 20 Aug 2019

@bjoernhaeuser I believe this was the fix in the Linkerd2 proxy:
https://github.com/linkerd/linkerd2-proxy/pull/307
...which was rolled up into the Linkerd2 repo in:
https://github.com/linkerd/linkerd2/pull/3235

siggy on 20 Aug 2019

🎉1

@bjoernhaeuser rather than upgrade to the latest edge, you can upgrade to the latest stable release, Linkerd 2.5.0.

The fix for the proxy is included in that release 🎆

cpretzer on 21 Aug 2019

Started to see that warning again. For now, I only see it for a single deployment (out of 20). That particular service is CPU-bound and requests can take between 1 and 5 seconds to complete.

bourquep on 22 Aug 2019

(on Linkerd stable-2.5.0)

bourquep on 22 Aug 2019

I have restarted the pods in that deployment and the warnings went away.

bourquep on 22 Aug 2019

@bourquep thanks for the update. There are legitimate scenarios where the request aborted because it reached the configured dispatch deadline is written to the log files.

If you see this happen again, please collect the logs so that we can see where the request was being sent when the dispatch occurred. That will help to understand whether there is unexpected behavior or the message is correctly written to the logs.

cpretzer on 22 Aug 2019

Haven’t seen those logs since I restarted the problematic pods yesterday. Keeping an eye on the logs.

bourquep on 24 Aug 2019

@bourquep @bjoernhaeuser Have either of you seen this behavior since upgrading to Linkerd 2.5.0? If not, can you close this issue?

cpretzer on 27 Aug 2019

A few days ago (see above), I saw those logs while on 2.5.0. It has not happened again since then (after I restarted the pods in my failing deployment)

bourquep on 27 Aug 2019

It's worth noting that it's possible for this error message to occur even with the fix for the #3235

If it happens again, can you collect details around events occurring in the environment around that time and reopen this issue?

cpretzer on 28 Aug 2019

I will!

bourquep on 28 Aug 2019

I'm also having these warnings ... any guess about what it could be ?

yanngit on 11 Oct 2019

👍1

@yanngit Are you using resource requests for your workloads?

grampelberg on 11 Oct 2019

What do you mean ?
I'm currently having these messages when I try to use my app (about 60 micro services) with all the control plane down ... as it's stated in the doc : the app should be able to continue the job even if the control plane is down (and the mesh not changing) ! And it's not what I'm observing ...

yanngit on 17 Oct 2019

@yanngit can you share a bit more detail about this?

It sounds like the control plane becomes unavailable and the linkerd-proxy begins to log the error message in the title of this issue.

Are you microservices receiving requests when the control plane is down? To answer your question, I believe that @grampelberg is asking if your deployments are configured to use memory and cpu requests: https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#how-pods-with-resource-requests-are-scheduled

cpretzer on 17 Oct 2019

@cpretzer Yes the api gateway is receiving some requests (it's a NodePort service) but then we are observing a loos in the gRPC communication ... But this occurs only when we shutdown the node where linkerd control plane is. If we are scaling gently the control plane to 0, everything seems to be working.
@grampelberg yes we are using resources requests for our deployments, something like that :

Requests:
cpu: 10m
memory: 128Mi

yanngit on 18 Oct 2019

@yanngit this is helpful information. So, you're intentionally shutting down the control plane node to observe the behavior of linkerd when the control plane is not available. Does that sound right?

Can you describe more which gRPC communication is being lost? Based on what you've sent so far, I understand that you've got an API Gateway configured as a NodePort.

Where does this API gateway direct traffic?
What type of API gateway is it? For example, is it an open source project like NGINX or Ambassador?
Do you have services which are not using gRPC? If so, do they exhibit a similar loss of connectivity?
Is the API gateway injected with the linkerd proxy?
Have you tried using the linkerd tap command to get information about the traffic being sent to the services?

Please provide as much detail as you can sot that we can understand what might be happening in your environment.

cpretzer on 18 Oct 2019

@yanngit this is helpful information. So, you're intentionally shutting down the control plane node to observe the behavior of linkerd when the control plane is not available. Does that sound right?

That's exactly it. We've had problems on our nodes and noticed the issue at this time so we're making sure it won't happen again and can reproduce the problem everytime we fail a node.

Can you describe more which gRPC communication is being lost? Based on what you've sent so far, I understand that you've got an API Gateway configured as a NodePort.

That's it

* Where does this API gateway direct traffic?

Coming from an nginx outside of k8s to this API via nodeport, then there are multiple communication with other microservice inside the k8s cluster using GRPC

* What type of API gateway is it? For example, is it an open source project like NGINX or Ambassador?

It's the nodeJS web server + express

* Do you have services which are not using gRPC? If so, do they exhibit a similar loss of connectivity?

We do use keycloak that is non GRPC and do not have the similar problems with it

* Is the API gateway injected with the linkerd proxy?

All our pods are

* Have you tried using the `linkerd tap` command to get information about the traffic being sent to the services?

Yes, we have, but it doesn't show anything usefull at these times, sometimes there's nothing leaving the linkerd-proxy and still we can't seem to establish the GRPC connection.

Please provide as much detail as you can sot that we can understand what might be happening in your environment.

k8s 1.12 (but tried with 1.15 too, same pb) installed on vmware VMs via kubespray. Using weave but tried with calico and having the same issue.
I'm not sure what else i could bring up as interresting elements

yanngit on 21 Oct 2019

@yanngit are you installing linkerd with the --ha option?

I attempted to reproduce this behavior with the following setup:

Kubernetes 1.15.2 installed using kubeadm on EC2 instances
Linkerd installed with the --ha option

In this cluster, replicas of the linkerd control plane components were deployed to three instances other than the master node. I manually shutdown the EC2 instance with sudo shutdown -h now and the linkerd control plane components on that node were eventually migrated to a new node.

If you can reliably reproduce this, can you send the output from the linkerd endpoints command when the proxies are logging these errors?

...then there are multiple communication with other microservice inside the k8s cluster using GRPC

If I'm reading that correctly, then the gRPC communication that is lost is between the API gateway and the microservices to which it sends requests.

What errors are you seeing in the log files for your services?

In your initial comment on this thread, you wrote:

I'm also having these warnings ... any guess about what it could be ?

Which proxies are logging these warnings? Is it the proxy injected in to the API gateway pod? Or the proxies injected in to the microservices pods? Are all proxies logging these errors?

cpretzer on 29 Oct 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Wire up stats and dashboards for Jobs

klingerf · 3Comments

Proxy's logs have timestamps in an unreadable format different than what's typically used by Kubernetes

briansmith · 4Comments

Add validation to the New Service Profile popup form

alpeb · 3Comments

`top` output has no column headers

wmorgan · 3Comments

Add CNI Chart To Helm Hub

ihcsim · 4Comments

Linkerd2: Linkerd 2.4.0: request aborted because it reached the configured dispatch deadline

Bug Report

What is the issue?

How can it be reproduced?

Logs, error output, etc

linkerd check output

Environment

Possible solution

Additional context

All 35 comments

Related issues

`linkerd check` output