All deployments where linkerd was injected errored out at the same time with:
proxy={server=out listen=127.0.0.1:4140 remote=10.10.45.84:41846} linkerd2_proxy::app::errors request aborted because it reached the configured dispatch deadline
I do not really know, run 2.4.0 in a environment where there is a lot of container churn and only some deployments are meshed. It took around 12 hours of runtime (constant traffic) to cause the error.
There is the mentioned log message in all deployments. Otherwise there is nothing.
linkerd check outputkubernetes-api
--------------
β can initialize the client
β can query the Kubernetes API
kubernetes-version
------------------
β is running the minimum Kubernetes API version
β is running the minimum kubectl version
linkerd-config
--------------
β control plane Namespace exists
β control plane ClusterRoles exist
β control plane ClusterRoleBindings exist
β control plane ServiceAccounts exist
β control plane CustomResourceDefinitions exist
β control plane MutatingWebhookConfigurations exist
β control plane ValidatingWebhookConfigurations exist
β control plane PodSecurityPolicies exist
linkerd-existence
-----------------
β 'linkerd-config' config map exists
β control plane replica sets are ready
β no unschedulable pods
β controller pod is running
β can initialize the client
β can query the control plane API
linkerd-api
-----------
β control plane pods are ready
β control plane self-check
β [kubernetes] control plane can talk to Kubernetes
β [prometheus] control plane can talk to Prometheus
β no invalid service profiles
linkerd-version
---------------
β can determine the latest version
β cli is up-to-date
control-plane-version
---------------------
β control plane is up-to-date
β control plane and cli versions match
Status check results are β
@bjoernhaeuser Thanks for bringing this to our attention. A few questions:
request aborted because it reached the configured dispatch deadline message in your application log?Further information from our _slack_ convo (for others' reference),
Are you seeing these warnings on both the client and server sides?
On both sides.
Any insights into the proxy container's resource utilization when this happens? Curious what it looks like after running for 12 hours in your environment.
Next to 0 cpu usage (< 0.04 cores), stable memory usage between 20mb and 50mb, depending on the deployment.
Re "There is the mentioned log message in all deployments.", you mean you are seeing the request aborted because it reached the configured dispatch deadline message in your application log?
No, not in the application logs. There I only see 503. What I meant is that all deployments which are meshed showed the same error message.
We suspect this might be related to https://github.com/linkerd/linkerd2/issues/2839. To help us confirm, when the problem happens again, can you grab a snapshot of the resource footprint of the destination container with:
linkerd -n linkerd metrics po/<linkerd-controller-pod-name>
If the destination container's memory usage is indeed very high, the current workaround is to restart the Linkerd controller pod. Keep us posted.
In addition to the command that @ihcsim suggested (which will provide proxy metrics from the control plane) it would also be useful to see metrics directly from the destination service itself. You can get those by running:
kubectl -n linkerd port-forward deploy/linkerd-controller 9996 &
curl localhost:9996/metrics
I'm also experiencing this. Our services can trigger it in a few minutes with gRPC Unary calls.
Here are the requested metrics.
@kdelorey you wouldn't happen to have a small example that we could run as well?
@kdelorey would you mind opening a new issue? I know that you're experiencing similar problems to what's described in this ticket, but I don't want to assume that the situation is identical. If it turns out that both issues have the same root cause, we can resolve both at once. Please make sure to include the controller metrics as described in my previous comment in addition to the proxy metrics you shared. Thanks!
@bjoernhaeuser any luck getting those destination service metrics?
@grampelberg I'll try to reproduce it with a small example and include it with a new issue based on @adleong's feedback.
@bjoernhaeuser any luck getting those destination service metrics?
We currently disabled linkerd in our system since the last outage. We will reenable it soon for a minor service and hope to get the metrics. Stay tuned! :)
@bjoernhaeuser we really want to get this working for you. Standing by eagerly
I chatted with Sven on the Linkerd slack, he says they'll be able to add Linkerd back to a less-critical service next week and get us some metrics data.
Hi @bjoernhaeuser we've just released an edge that we hope has a fix for this issue, please give it a try and let us know what you find.
https://github.com/linkerd/linkerd2/releases/tag/edge-19.8.3
FYI, I was also seeing those errors in the proxy logs, and have had 0 occurrences since upgrading to the latest edge release.
Oh thats good news! This week we enabled linkerd again, by splitting up deployments, so that only 33% to 50% of the traffic is served through/with linkerd enabled.
We are still using 2.4.0 and we will update this issue when we have new findings!
@admc can you refer me to the PR which has the fix for this which is included in the latest edge? Thanks!
@bjoernhaeuser I believe this was the fix in the Linkerd2 proxy:
https://github.com/linkerd/linkerd2-proxy/pull/307
...which was rolled up into the Linkerd2 repo in:
https://github.com/linkerd/linkerd2/pull/3235
@bjoernhaeuser rather than upgrade to the latest edge, you can upgrade to the latest stable release, Linkerd 2.5.0.
The fix for the proxy is included in that release π
Started to see that warning again. For now, I only see it for a single deployment (out of 20). That particular service is CPU-bound and requests can take between 1 and 5 seconds to complete.
(on Linkerd stable-2.5.0)
I have restarted the pods in that deployment and the warnings went away.
@bourquep thanks for the update. There are legitimate scenarios where the request aborted because it reached the configured dispatch deadline is written to the log files.
If you see this happen again, please collect the logs so that we can see where the request was being sent when the dispatch occurred. That will help to understand whether there is unexpected behavior or the message is correctly written to the logs.
Havenβt seen those logs since I restarted the problematic pods yesterday. Keeping an eye on the logs.
@bourquep @bjoernhaeuser Have either of you seen this behavior since upgrading to Linkerd 2.5.0? If not, can you close this issue?
A few days ago (see above), I saw those logs while on 2.5.0. It has not happened again since then (after I restarted the pods in my failing deployment)
It's worth noting that it's possible for this error message to occur even with the fix for the #3235
If it happens again, can you collect details around events occurring in the environment around that time and reopen this issue?
I will!
I'm also having these warnings ... any guess about what it could be ?
@yanngit Are you using resource requests for your workloads?
What do you mean ?
I'm currently having these messages when I try to use my app (about 60 micro services) with all the control plane down ... as it's stated in the doc : the app should be able to continue the job even if the control plane is down (and the mesh not changing) ! And it's not what I'm observing ...
@yanngit can you share a bit more detail about this?
It sounds like the control plane becomes unavailable and the linkerd-proxy begins to log the error message in the title of this issue.
Are you microservices receiving requests when the control plane is down? To answer your question, I believe that @grampelberg is asking if your deployments are configured to use memory and cpu requests: https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#how-pods-with-resource-requests-are-scheduled
@cpretzer Yes the api gateway is receiving some requests (it's a NodePort service) but then we are observing a loos in the gRPC communication ... But this occurs only when we shutdown the node where linkerd control plane is. If we are scaling gently the control plane to 0, everything seems to be working.
@grampelberg yes we are using resources requests for our deployments, something like that :
Requests:
cpu: 10m
memory: 128Mi
@yanngit this is helpful information. So, you're intentionally shutting down the control plane node to observe the behavior of linkerd when the control plane is not available. Does that sound right?
Can you describe more which gRPC communication is being lost? Based on what you've sent so far, I understand that you've got an API Gateway configured as a NodePort.
linkerd tap command to get information about the traffic being sent to the services?Please provide as much detail as you can sot that we can understand what might be happening in your environment.
@yanngit this is helpful information. So, you're intentionally shutting down the control plane node to observe the behavior of
linkerdwhen the control plane is not available. Does that sound right?
That's exactly it. We've had problems on our nodes and noticed the issue at this time so we're making sure it won't happen again and can reproduce the problem everytime we fail a node.
Can you describe more which gRPC communication is being lost? Based on what you've sent so far, I understand that you've got an
API Gatewayconfigured as aNodePort.
That's it
* Where does this API gateway direct traffic?
Coming from an nginx outside of k8s to this API via nodeport, then there are multiple communication with other microservice inside the k8s cluster using GRPC
* What type of API gateway is it? For example, is it an open source project like NGINX or Ambassador?
It's the nodeJS web server + express
* Do you have services which are not using gRPC? If so, do they exhibit a similar loss of connectivity?
We do use keycloak that is non GRPC and do not have the similar problems with it
* Is the API gateway injected with the linkerd proxy?
All our pods are
* Have you tried using the `linkerd tap` command to get information about the traffic being sent to the services?
Yes, we have, but it doesn't show anything usefull at these times, sometimes there's nothing leaving the linkerd-proxy and still we can't seem to establish the GRPC connection.
Please provide as much detail as you can sot that we can understand what might be happening in your environment.
k8s 1.12 (but tried with 1.15 too, same pb) installed on vmware VMs via kubespray. Using weave but tried with calico and having the same issue.
I'm not sure what else i could bring up as interresting elements
@yanngit are you installing linkerd with the --ha option?
I attempted to reproduce this behavior with the following setup:
--ha optionIn this cluster, replicas of the linkerd control plane components were deployed to three instances other than the master node. I manually shutdown the EC2 instance with sudo shutdown -h now and the linkerd control plane components on that node were eventually migrated to a new node.
If you can reliably reproduce this, can you send the output from the linkerd endpoints command when the proxies are logging these errors?
...then there are multiple communication with other microservice inside the k8s cluster using GRPC
If I'm reading that correctly, then the gRPC communication that is lost is between the API gateway and the microservices to which it sends requests.
What errors are you seeing in the log files for your services?
In your initial comment on this thread, you wrote:
I'm also having these warnings ... any guess about what it could be ?
Which proxies are logging these warnings? Is it the proxy injected in to the API gateway pod? Or the proxies injected in to the microservices pods? Are all proxies logging these errors?