Linkerd2: lInkerd-proxy connection refused/closed/reset errors.

Created on 4 Dec 2020 · 13Comments · Source: linkerd/linkerd2

Bug Report

What is the issue?

I am working on finding the root cause for an increased connection drops/failures on our linkerd injected apps which happens rarely and noticed a lot of connection related errors (Connection closed error=Service in fail-fast” Connection reset by peer (os error 104) Connection refused (os error 111) Transport endpoint is not connected (os error 107) etc..) in the proxy log on all our services and I am wondering if linkerd has got anything to do with the increased failures in our apps. I'm kind of stuck in finding the root cause, so I would really appreciate if someone can explain more about these connection errors.

How can it be reproduced?

I am not able to reproduce the issue, the issue (increased connection failures) rarely happens and there are a lot of tcp connection closed/refused errors in the proxy logs as seen below.

Logs, error output, etc

"Dec 2, 2020 @ 06:45:47.043","[ 80905.816738s]  INFO ThreadId(01) outbound:accept{peer.addr=10.244.90.101:39120 target.addr=10.244.57.132:9092}: linkerd2_app_core::serve: Connection closed error=Service in fail-fast"
"Dec 2, 2020 @ 06:45:47.042","[ 80905.241029s]  INFO ThreadId(01) outbound:accept{peer.addr=10.244.90.101:34980 target.addr=10.244.54.131:9092}: linkerd2_app_core::serve: Connection closed error=Service in fail-fast"
"Dec 2, 2020 @ 06:45:51.858","[    10.069981s]  WARN ThreadId(01) inbound:accept{peer.addr=10.244.33.105:37338 target.addr=10.244.33.169:80}: linkerd2_app_core::errors: Failed to proxy request: error trying to connect: Connection refused (os error 111)"
"Dec 2, 2020 @ 06:45:51.858","[    11.203328s]  WARN ThreadId(01) inbound:accept{peer.addr=10.244.33.1:46326 target.addr=10.244.33.169:80}: linkerd2_app_core::errors: Failed to proxy request: error trying to connect: Connection refused (os error 111)"
"Dec 2, 2020 @ 06:45:51.858","[    22.103038s]  INFO ThreadId(01) outbound:accept{peer.addr=10.244.33.169:49660 target.addr=10.244.59.53:9092}: linkerd2_app_core::serve: Connection closed error=Service in fail-fast"

```
[ 4784.452349s] INFO ThreadId(01) inbound:accept{peer.addr=10.244.44.107:44614 target.addr=10.244.44.106:80}: linkerd2_app_core::serve: Connection closed error=connection error: Transport endpoint is not connected (os error 107)
[ 5089.706325s] INFO ThreadId(01) inbound:accept{peer.addr=10.244.44.107:44636 target.addr=10.244.44.106:80}: linkerd2_app_core::serve: Connection closed error=connection error: Transport endpoint is not connected (os error 107)
[ 5124.234815s] INFO ThreadId(01) inbound:accept{peer.addr=10.244.44.107:46202 target.addr=10.244.44.106:80}: linkerd2_app_core::serve: Connection closed error=connection error: Transport endpoint is not connected (os error 107)

[ 5318.467170s] INFO ThreadId(01) outbound:accept{peer.addr=10.244.44.106:60420 target.addr=10.97.182.190:5432}: linkerd2_app_core::serve: Connection closed error=Connection reset by peer (os error 104)
[ 5329.525274s] INFO ThreadId(01) outbound:accept{peer.addr=10.244.44.106:59506 target.addr=10.97.182.190:5432}: linkerd2_app_core::serve: Connection closed error=Connection reset by peer (os error 104)
[ 5331.837249s] INFO ThreadId(01) outbound:accept{peer.addr=10.244.44.106:58566 target.addr=10.97.182.190:5432}: linkerd2_app_core::serve: Connection closed error=Connection reset by peer (os error 104)

#### `linkerd check` output

```linkerd check
kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor

linkerd-webhooks-and-apisvc-tls
-------------------------------
√ tap API server has valid cert
√ tap API server cert is valid for at least 60 days
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days

linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ tap api service is running

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

control-plane-version
---------------------
√ control plane is up-to-date
√ control plane and cli versions match

Status check results are √

Environment

Kubernetes Version: v1.18.8
Cluster Environment: Self hosted, Deployed via kubeadm
Host OS: Ubuntu 18.04.2 LTS
Linkerd version: stable-2.9.0 (deployed using helm chart)

Possible solution

Additional context

prioritP0

Source

prasus

👍1

Most helpful comment

Is your PostgreSQL service headless?

It's RDS, lives not in Kubernetes. Meanwhile I added RDS and some other ports to skipped. Linkerd upgrade requires cluster downtime, so we cannot do this too often.

KIVagant on 11 Dec 2020

👍3

All 13 comments

We also noticed thousands of errors after upgrading to 2.9.0. They appeared when we redeployed pods with the newer proxy sidecar. There are no visible application errors, but a lot of proxy errors like this:

[ 75233.174384s]  INFO ThreadId(01) outbound:accept{peer.addr=10.X.X.X:50826 target.addr=10.X.X.X:5432}: linkerd2_app_core::serve: Connection closed error=Connection reset by peer (os error 104)

5432 is a PostgreSQL port. We don't filter out Postgres ports in skip-outbound-ports. And it was not necessary with linkerd-proxy 2.7.1. Is it necessary now?

Update: there are some errors related to other non-http services, not only Postgres.

Update2: and many errors from http communications. It looks like the verbosity of the proxy was changed. And we probably need to switch the verbosity to warning/errors only.

KIVagant on 9 Dec 2020

Is your PostgreSQL service headless? If so, this sounds like https://github.com/linkerd/linkerd2/issues/5209. This issue caused problems when talking to headless services in stable-2.9.0. It has since been fixed and you can try out the latest edge release to see if it resolves the problem. The fix will be included in the stable-2.9.1 patch release.

adleong on 10 Dec 2020

Thank you @adleong 👍🏿 I went through the logs and it seems like we are having these messages for non-headless service targets as well.

[  5351.451279s]  INFO ThreadId(01) outbound:accept{peer.addr=10.244.44.106:48372 target.addr=10.97.182.190:5432}: linkerd2_app_core::serve: Connection closed error=Connection reset by peer (os error 104)

The target.addr (PostgreSQL) is a non-headless service.

Name:              my-svc-name
Namespace:         my-svc-namespace
Labels:            application=spilo
                   spilo-role=master

Annotations:       <none>
Selector:          <none>
Type:              ClusterIP
IP:                10.97.182.190
Port:              postgresql  5432/TCP
TargetPort:        5432/TCP
Endpoints:         10.244.24.108:5432
Session Affinity:  None
Events:            <none>

I will plan for an update once we have the stable-2.9.1 patch release available and will keep you posted on this. Thanks! 🙂

prasus on 11 Dec 2020

👍2

Is your PostgreSQL service headless?

It's RDS, lives not in Kubernetes. Meanwhile I added RDS and some other ports to skipped. Linkerd upgrade requires cluster downtime, so we cannot do this too often.

KIVagant on 11 Dec 2020

👍3

I am also seeing a good amount of these logs. I upgraded our staging environment to stable-2.9.1 in hopes of resolution, but to no avail.

Here is a sample of logs in one of our linkerd-proxy containers:

[515070.841395s]  INFO ThreadId(01) inbound:accept{peer.addr=10.0.130.141:43732 target.addr=10.0.130.174:8080}: linkerd2_app_core::serve: Connection closed error=connection error: Transport endpoint is not connected (os error 107)
[515185.917709s]  INFO ThreadId(01) inbound:accept{peer.addr=10.0.130.141:45990 target.addr=10.0.130.174:8080}: linkerd2_app_core::serve: Connection closed error=connection error: Transport endpoint is not connected (os error 107)
[516061.897349s]  INFO ThreadId(01) outbound:accept{peer.addr=10.0.130.174:36508 target.addr=10.0.128.58:5432}: linkerd2_app_core::serve: Connection closed error=Connection reset by peer (os error 104)
[516582.706765s]  INFO ThreadId(01) inbound:accept{peer.addr=10.0.130.141:45688 target.addr=10.0.130.174:8080}: linkerd2_app_core::serve: Connection closed error=connection error: Transport endpoint is not connected (os error 107)
[517068.129177s]  INFO ThreadId(01) outbound:accept{peer.addr=10.0.130.174:56128 target.addr=10.0.128.58:5432}: linkerd2_app_core::serve: Connection closed error=Connection reset by peer (os error 104)
[517373.616350s]  INFO ThreadId(01) inbound:accept{peer.addr=10.0.130.141:32954 target.addr=10.0.130.174:8080}: linkerd2_app_core::serve: Connection closed error=connection error: Transport endpoint is not connected (os error 107)
[517488.739828s]  INFO ThreadId(01) inbound:accept{peer.addr=10.0.130.141:35686 target.addr=10.0.130.174:8080}: linkerd2_app_core::serve: Connection closed error=connection error: Broken pipe (os error 32)
[517787.344259s]  INFO ThreadId(01) inbound:accept{peer.addr=10.0.130.119:60352 target.addr=10.0.130.174:8080}: linkerd2_app_core::serve: Connection closed error=connection error: Transport endpoint is not connected (os error 107)
[517974.381304s]  INFO ThreadId(01) inbound:accept{peer.addr=10.0.130.141:45394 target.addr=10.0.130.174:8080}: linkerd2_app_core::serve: Connection closed error=connection error: Transport endpoint is not connected (os error 107)
[518461.897379s]  INFO ThreadId(01) outbound:accept{peer.addr=10.0.130.174:33934 target.addr=10.0.128.58:5432}: linkerd2_app_core::serve: Connection closed error=Connection reset by peer (os error 104)
[518703.636162s]  INFO ThreadId(01) inbound:accept{peer.addr=10.0.130.119:50810 target.addr=10.0.130.174:8080}: linkerd2_app_core::serve: Connection closed error=connection error: Transport endpoint is not connected (os error 107)
[519482.176746s]  INFO ThreadId(01) inbound:accept{peer.addr=10.0.130.141:46966 target.addr=10.0.130.174:8080}: linkerd2_app_core::serve: Connection closed error=connection error: Transport endpoint is not connected (os error 107)

Our PostgreSQL instance is through RDS. The error that prompted me to start looking into this was a broken pipe error in our own services, though I'm wondering if the broken pipe log from the proxy is related. I have not been able to reproduce the error in our own services so that I can see it in action.

Hoping that my info that this is still occurring on stable-2.9.1 helps diagnose this.

isabey-cogni on 22 Dec 2020

👍1

I ran into these errors when deploying Linkerd for the first time recently.
I'm not seeing it mentioned here, but two other recent open issues relating to 2.9.1 are very similar to this issue, and supposedly it'll be corrected once https://github.com/linkerd/linkerd2-proxy/pull/782 is released.

chicken231 on 22 Dec 2020

Without a repro config we can apply, it's difficult to know whether our recent changes will impact this, but there are two recent changes which may have some bearing on this:

As @chicken231, we've changed to default protocol detection timeout. This _may_ be a source of some of these errors, but it's difficult to know without more data
Additionally, we've introduced a new config.linkerd.io/opaque-ports annotation which configures the proxy to transport traffic bypassing the normal HTTP detection logic. This is primarily intended for server-speaks-first protocols but may also make sense in other non-HTTP use cases like Postgres.

Both of these changes will be available in the next edge release -- probably next week.

But, as I said, it's hard for us to know for sure without a configuration that we can test against.

Also, it's not clear to me from the above comments whether the problem is that the proxy is logging these messages or whether these are manifesting as an application problem... I.e. Do we need to just stop logging these messages at the INFO level? Or are we seeing unexpected connection issues that impact the application? If the latter case is true, it would be especially useful to see pcap data (i.e. from ksniff) that illustrates the sequence of packets causing problems.

olix0r on 23 Dec 2020

@olix0r With regard to my comment, the only times this seems to have manifested as an application error is the os error 32.

[517488.739828s]  INFO ThreadId(01) inbound:accept{peer.addr=10.0.130.141:35686 target.addr=10.0.130.174:8080}: linkerd2_app_core::serve: Connection closed error=connection error: Broken pipe (os error 32)

Whenever those pop up they align with the application errors (also broken pipe).

isabey-cogni on 23 Dec 2020

👍1

I am also seeing a lot of this kind of logs as @prasus in versions 2.9.1 and 2.9.0

Connection closed error=connection error: Broken pipe (os error 32)
Connection closed error=connection error: Transport endpoint is not connected (os error 107)

downgrade to 2.8.1 and logs gone

creativearmenia on 27 Dec 2020

Thanks, that's helpful.

But, as mentioned above, we'll probably need more details to identify the cause of this error. Ideally, we'd have a minimal application that we can deploy to reproduce the problem. If that's not possible, it would be helpful to know more about:

The application protocol being used for these connections (HTTP? something else?);
The traffic pattern for such a connection (ideally, PCAP data form ksniff);
Detailed proxy logs (i.e. config.linkerd.io/proxy-log-level: linkerd=trace,warn)

The broken pipe error just tells us that we're writing to a socket on which the reader has already closed. It's hard for us to know why this is occurring without more details about the application.

olix0r on 27 Dec 2020

Hi @olix0r !

I suppose the same error I've got using Zalando's postgres-operator (_spilo_ from above suggests that).
If you have kind installed or have somewhere a cluster used for testing in the default namespace, here are the steps to reproduce the same error:

echo """# three node (two workers) cluster config
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker
""" > kind-example-config.yaml

kind create cluster --config kind-example-config.yaml
linkerd install | kubectl apply -f -
kubectl wait --namespace linkerd --for=condition=ready pod --selector=linkerd.io/control-plane-component=controller --timeout=480s
kubectl get namespace default -o yaml | linkerd inject - | kubectl apply -f -

git clone https://github.com/zalando/postgres-operator.git
cd postgres-operator

kubectl create -f manifests/configmap.yaml
kubectl create -f manifests/operator-service-account-rbac.yaml
kubectl create -f manifests/postgres-operator.yaml
kubectl wait --for=condition=ready pod -l name=postgres-operator --timeout=480s
kubectl create -f manifests/minimal-postgres-manifest.yaml
kubectl wait --for=condition=ready pod -l statefulset.kubernetes.io/pod-name=acid-minimal-cluster-0 --timeout=480s

kubectl logs acid-minimal-cluster-0 -c postgres

And the output is a plenty of lines such:

2021-01-13 21:33:15,559 ERROR: Request to server https://172.18.0.4:6443 failed: MaxRetryError("HTTPSConnectionPool(host='172.18.0.4', port=6443): Max retries exceeded with url: /api/v1/namespaces/default/endpoints?labelSelector=application%3Dspilo%2Ccluster-name%3Dacid-minimal-cluster (Caused by ProtocolError('Connection aborted.', OSError(0, 'Error')))",)

The same happens with the last _edge_ too.

It won't happen with stable 2.8.1, but I suppose that's because of mTLS functionality - injecting the namespace in that version won't automatically affect created pods.

Edit: forgot to add the repository cloning command...