Linkerd2: Intermittent 502 Bad Gateway issue when service is meshed

Created on 12 Aug 2020  ยท  23Comments  ยท  Source: linkerd/linkerd2

Bug Report

What is the issue?

Without linkerd proxy, there are no trace of 502 bad gateway at the ingress level nor the app level
With linkerd proxy enabled on nginx ingress and or app, intermittent 502 Bad Gateway appears in the system.

I see 2 types of error from the proxy:

## for this one the requests actually made it through the upstream
linkerd2_app_core::errors: Failed to proxy request: connection closed before message completed

and

## for this one the requests never made it to the upstream
linkerd2_app_core::errors: Failed to proxy request: connection error: Connection reset by peer (os error 104)

both lead to (on ingress)

"GET /v1/users?ids=xxxxxxx,xxxxx HTTP/1.1" 502 0 "-" "python-requests/2.23.0" 1708 0.097 [xxxxx-app-http] [] 10.4.27.106:80 0 0.100 502 7543fb48ce22d5c6145b97daadde93d9

or (on an app)

## Here I suspect the connection timeout that could be tuned to allow more time to the proxy to establish the connection. I manage to prove that by giving more resources to datadog agent daemonset

Failed to send traces to Datadog Agent at http://10.4.87.60:8126: HTTP error status 502, reason Bad Gateway, message content-length: 0
date: Wed, 12 Aug 2020 17:58:22 GMT

I have eliminated the scale-up and scale-down event of ingress and app. I also ruled out the graceful termination which I gave a period of 30s on app and 40s on ingress for the sleep on the proxy.

How can it be reproduced?

Logs, error output, etc

here are some linkerd-debug logs where 10.3.45.50 is the ingress controller pod and 10.3.23.252 is the upstream server pod

1829200 2893.254017273    127.0.0.1 โ†’ 127.0.0.1    TCP 68 443 โ†’ 57590 [ACK] Seq=3277 Ack=1894 Win=175744 Len=0 TSval=378310645 TSecr=378310644
1829201 2893.254172427   10.3.45.50 โ†’ 127.0.0.1    HTTP 1701 GET /v1/users/xxxxxxxx?include=active_contract HTTP/1.1 
1829202 2893.254471803   10.3.45.50 โ†’ 10.3.23.252  TCP 76 47414 โ†’ 80 [SYN] Seq=0 Win=26883 Len=0 MSS=8961 SACK_PERM=1 TSval=2090260178 TSecr=0 WS=128
1829203 2893.254837416  10.3.23.252 โ†’ 10.3.45.50   TCP 56 80 โ†’ 47414 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0
1829204 2893.254974304  10.3.23.252 โ†’ 10.3.45.50   HTTP 152 HTTP/1.1 502 Bad Gateway 

I've got similar tracing with tcp_dump with ksniff kubectl plugin.

linkerd check output

kubernetes-api
--------------
โˆš can initialize the client
โˆš can query the Kubernetes API

kubernetes-version
------------------
โˆš is running the minimum Kubernetes API version
โˆš is running the minimum kubectl version

linkerd-existence
-----------------
โˆš 'linkerd-config' config map exists
โˆš heartbeat ServiceAccount exist
โˆš control plane replica sets are ready
โˆš no unschedulable pods
โˆš controller pod is running
โˆš can initialize the client
โˆš can query the control plane API

linkerd-config
--------------
โˆš control plane Namespace exists
โˆš control plane ClusterRoles exist
โˆš control plane ClusterRoleBindings exist
โˆš control plane ServiceAccounts exist
โˆš control plane CustomResourceDefinitions exist
โˆš control plane MutatingWebhookConfigurations exist
โˆš control plane ValidatingWebhookConfigurations exist
โˆš control plane PodSecurityPolicies exist

linkerd-identity
----------------
โˆš certificate config is valid
โˆš trust anchors are using supported crypto algorithm
โˆš trust anchors are within their validity period
โˆš trust anchors are valid for at least 60 days
โˆš issuer cert is using supported crypto algorithm
โˆš issuer cert is within its validity period
โ€ผ issuer cert is valid for at least 60 days
    issuer certificate will expire on 2020-08-13T08:39:27Z
    see https://linkerd.io/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints
โˆš issuer cert is issued by the trust anchor

linkerd-api
-----------
โˆš control plane pods are ready
โˆš control plane self-check
โˆš [kubernetes] control plane can talk to Kubernetes
โˆš [prometheus] control plane can talk to Prometheus
โˆš tap api service is running

linkerd-version
---------------
โˆš can determine the latest version
โˆš cli is up-to-date

control-plane-version
---------------------
โˆš control plane is up-to-date
โˆš control plane and cli versions match

linkerd-addons
--------------
โˆš 'linkerd-config-addons' config map exists

Environment

  • Kubernetes Version: 1.16, 1.17
  • Cluster Environment: EKS
  • Host OS: Amazon Linux 2 (eks optimized ami)
  • Linkerd version: 2.8.1
  • nginx ingress version: 0.34.1
  • upstream app: python 3.7 with hypercorn server

nginx ingress service annotations

    service.beta.kubernetes.io/aws-load-balancer-backend-protocol: tcp
    service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "60"
    service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
    service.beta.kubernetes.io/aws-load-balancer-proxy-protocol: '*'
    service.beta.kubernetes.io/aws-load-balancer-type: elb

nginx ingress deployment annotation

   config.alpha.linkerd.io/proxy-wait-before-exit-seconds: 40
   linkerd.io/inject: enabled

Possible solution

Tuning all timeouts per app/service with proxy annotation and also globally with global configuration from chart deployment.

  • keep-alive
  • connection
  • read timeout (this one I didn't see in linkerd but similar to nginx)

Additional context

areproxy needrepro

All 23 comments

@mbelang To confirm what we discussed on slack, you added CPU resource requests to your nginx ingress controller's proxy and there were no CPU spikes on the proxy when the connection timeout happened. Then you tried increasing the outbound connection timeout on the proxy of the nginx ingress by changing the LINKERD2_PROXY_OUTBOUND_CONNECT_TIMEOUT env var per https://github.com/linkerd/linkerd2/pull/4759, but to no avail.

Also the TCP dump shows the 502 coming back from the datadog agent, which is an uninjected workload running on the same cluster. No error logs on the upstream datadog agent.

Here I suspect the connection timeout that could be tuned to allow more time to the proxy to establish the connection. I manage to prove that by giving more resources to datadog agent daemonset

Does that mean the proxy sees less connection timeouts after you give more resources to the datadog agent?

Let me know if my understanding is correct.

To confirm what we discussed on slack, you added CPU resource requests to your nginx ingress controller's proxy and there were no CPU spikes on the proxy when the connection timeout happened

I'm not sure where I can see the specific metric for the proxy container. I didn't check that but from what I see so far, I get some 502 and the CPU on ingress pod is flat flat flat.

Also the TCP dump shows the 502 coming back from the datadog agent

The TCP dump above is from for ingress controller trying to contact upstream app, not datadog agent but I did got a trace for datadog agent is it is exactly the same.

by changing the LINKERD2_PROXY_OUTBOUND_CONNECT_TIMEOUT env var per

I did play with that trying different values from 5s to 15s. no luck. I also suspected the keep-alive which I tried to lower to 4s and increase to 90s without any luck either.

Does that mean the proxy sees less connection timeouts after you give more resources to the datadog agent?

Yes

Is there a way to set the read timeout: https://github.com/linkerd/linkerd2-proxy/blob/13b5fd65da6999f1d3d4d166983af8d54034d6e4/linkerd/app/integration/src/tcp.rs#L165 I didn't manage to see where is that function used and what is the default value.

If you see above, I have 2 problems.
1) connection reset by peer (the request never made it to the upstream service)
2) connection closed before message completed (I manage to find that the request actually made it through the upstream service but the connection was cut by the proxy (I imagine)

for 1) I'm trying to blame keep-alive or connection timeouts but no luck
for 2) I'm trying to blame the read timeout but I have no proof.....

So sounds like the errors are only seen on the nginx ingress controller outbound side? Do you have a minimum set of YAML that we can use for repro? Thanks.

So far yes, but I do have problems with a meshed app trying to reach out to datadog agent. I only have ingress and 1 app meshed in production environment. So far, for the app, no 502 for requests to other apps which is good.

The biggest problem now is the ingress.

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "14"
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"apps/v1","kind":"Deployment","metadata":{"annotations":{"deployment.kubernetes.io/revision":"13","meta.helm.sh/release-name":"ingress","meta.helm.sh/release-namespace":"ingress"},"generation":7154,"labels":{"app":"nginx-ingress","app.kubernetes.io/component":"controller","app.kubernetes.io/managed-by":"Helm","chart":"nginx-ingress-1.39.0","heritage":"Helm","release":"ingress"},"name":"ingress-nginx-ingress-controller","namespace":"ingress","resourceVersion":"62462358","selfLink":"/apis/apps/v1/namespaces/ingress/deployments/ingress-nginx-ingress-controller","uid":"c041898e-78dd-11ea-ad31-0e9b9c5b4912"},"spec":{"progressDeadlineSeconds":600,"replicas":3,"revisionHistoryLimit":10,"selector":{"matchLabels":{"app":"nginx-ingress","release":"ingress"}},"strategy":{"rollingUpdate":{"maxSurge":"33%","maxUnavailable":0},"type":"RollingUpdate"},"template":{"metadata":{"annotations":{"ad.datadoghq.com/nginx-ingress-controller.check_names":"[\"nginx_ingress_controller\"]","ad.datadoghq.com/nginx-ingress-controller.init_configs":"[{}]","ad.datadoghq.com/nginx-ingress-controller.instances":"[{\"prometheus_url\": \"http://%%host%%:10254/metrics\"}]","config.alpha.linkerd.io/proxy-wait-before-exit-seconds":"40","kubectl.kubernetes.io/restartedAt":"2020-08-04T16:10:44-04:00","linkerd.io/created-by":"linkerd/cli stable-2.8.1","linkerd.io/identity-mode":"default","linkerd.io/proxy-version":"stable-2.8.1"},"labels":{"app":"nginx-ingress","app.kubernetes.io/component":"controller","component":"controller","linkerd.io/control-plane-ns":"linkerd","linkerd.io/proxy-deployment":"ingress-nginx-ingress-controller","linkerd.io/workload-ns":"ingress","release":"ingress"}},"spec":{"containers":[{"args":["/nginx-ingress-controller","--default-backend-service=ingress/ingress-nginx-ingress-default-backend","--election-id=ingress-controller-leader","--ingress-class=nginx","--configmap=ingress/ingress-nginx-ingress-controller"],"env":[{"name":"POD_NAME","valueFrom":{"fieldRef":{"apiVersion":"v1","fieldPath":"metadata.name"}}},{"name":"POD_NAMESPACE","valueFrom":{"fieldRef":{"apiVersion":"v1","fieldPath":"metadata.namespace"}}},{"name":"DD_AGENT_HOST","valueFrom":{"fieldRef":{"apiVersion":"v1","fieldPath":"status.hostIP"}}}],"image":"quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.32.0","imagePullPolicy":"IfNotPresent","livenessProbe":{"failureThreshold":3,"httpGet":{"path":"/healthz","port":10254,"scheme":"HTTP"},"initialDelaySeconds":10,"periodSeconds":10,"successThreshold":1,"timeoutSeconds":1},"name":"nginx-ingress-controller","ports":[{"containerPort":80,"name":"http","protocol":"TCP"},{"containerPort":443,"name":"https","protocol":"TCP"},{"containerPort":10254,"name":"metrics","protocol":"TCP"}],"readinessProbe":{"failureThreshold":3,"httpGet":{"path":"/healthz","port":10254,"scheme":"HTTP"},"initialDelaySeconds":10,"periodSeconds":10,"successThreshold":1,"timeoutSeconds":1},"resources":{"limits":{"memory":"512Mi"},"requests":{"cpu":"150m","memory":"512Mi"}},"securityContext":{"allowPrivilegeEscalation":true,"capabilities":{"add":["NET_BIND_SERVICE"],"drop":["ALL"]},"runAsUser":101},"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"File"},{"env":[{"name":"LINKERD2_PROXY_LOG","value":"warn,linkerd=info"},{"name":"LINKERD2_PROXY_DESTINATION_SVC_ADDR","value":"linkerd-dst.linkerd.svc.cluster.local:8086"},{"name":"LINKERD2_PROXY_DESTINATION_GET_NETWORKS","value":"10.0.0.0/8,172.16.0.0/12,192.168.0.0/16"},{"name":"LINKERD2_PROXY_CONTROL_LISTEN_ADDR","value":"0.0.0.0:4190"},{"name":"LINKERD2_PROXY_ADMIN_LISTEN_ADDR","value":"0.0.0.0:4191"},{"name":"LINKERD2_PROXY_OUTBOUND_LISTEN_ADDR","value":"127.0.0.1:4140"},{"name":"LINKERD2_PROXY_INBOUND_LISTEN_ADDR","value":"0.0.0.0:4143"},{"name":"LINKERD2_PROXY_DESTINATION_GET_SUFFIXES","value":"svc.cluster.local."},{"name":"LINKERD2_PROXY_DESTINATION_PROFILE_SUFFIXES","value":"svc.cluster.local."},{"name":"LINKERD2_PROXY_INBOUND_ACCEPT_KEEPALIVE","value":"10000ms"},{"name":"LINKERD2_PROXY_OUTBOUND_CONNECT_KEEPALIVE","value":"4000ms"},{"name":"_pod_ns","valueFrom":{"fieldRef":{"fieldPath":"metadata.namespace"}}},{"name":"LINKERD2_PROXY_DESTINATION_CONTEXT","value":"ns:$(_pod_ns)"},{"name":"LINKERD2_PROXY_IDENTITY_DIR","value":"/var/run/linkerd/identity/end-entity"},{"name":"LINKERD2_PROXY_IDENTITY_TRUST_ANCHORS","value":"-----BEGIN CERTIFICATE-----\nMIIBkjCCATmgAwIBAgIRAIIQc+6o+sH3bJmp1G7/55IwCgYIKoZIzj0EAwIwKTEn\nMCUGA1UEAxMeaWRlbnRpdHkubGlua2VyZC5jbHVzdGVyLmxvY2FsMB4XDTIwMDgw\nNDE5NTgyMloXDTMwMDgwMjE5NTgyMlowKTEnMCUGA1UEAxMeaWRlbnRpdHkubGlu\na2VyZC5jbHVzdGVyLmxvY2FsMFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAEt3MO\nWW9nuJbUijyH3freMWfL0Z90R/8R6iq5Me9Np+iVs4SzG6lrZyjhTN4d7N5szfCY\nii3HIe+AXLgvZXDZTKNCMEAwDgYDVR0PAQH/BAQDAgIEMA8GA1UdEwEB/wQFMAMB\nAf8wHQYDVR0OBBYEFC0bx7JhQ54epHUcBE2ZYWzQYaK3MAoGCCqGSM49BAMCA0cA\nMEQCIG4+7HaA/viOLhoukmyelwn76vlZ5VZCdbaG4Z9hCY03AiAprDSy71nkk5ii\nONYQvhbt15P7lUptu4j5nlhF5n+Iaw==\n-----END CERTIFICATE-----\n"},{"name":"LINKERD2_PROXY_IDENTITY_TOKEN_FILE","value":"/var/run/secrets/kubernetes.io/serviceaccount/token"},{"name":"LINKERD2_PROXY_IDENTITY_SVC_ADDR","value":"linkerd-identity.linkerd.svc.cluster.local:8080"},{"name":"_pod_sa","valueFrom":{"fieldRef":{"fieldPath":"spec.serviceAccountName"}}},{"name":"_l5d_ns","value":"linkerd"},{"name":"_l5d_trustdomain","value":"cluster.local"},{"name":"LINKERD2_PROXY_IDENTITY_LOCAL_NAME","value":"$(_pod_sa).$(_pod_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)"},{"name":"LINKERD2_PROXY_IDENTITY_SVC_NAME","value":"linkerd-identity.$(_l5d_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)"},{"name":"LINKERD2_PROXY_DESTINATION_SVC_NAME","value":"linkerd-destination.$(_l5d_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)"},{"name":"LINKERD2_PROXY_TAP_SVC_NAME","value":"linkerd-tap.$(_l5d_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)"}],"image":"gcr.io/linkerd-io/proxy:stable-2.8.1","imagePullPolicy":"IfNotPresent","lifecycle":{"preStop":{"exec":{"command":["/bin/bash","-c","sleep 40"]}}},"livenessProbe":{"httpGet":{"path":"/live","port":4191},"initialDelaySeconds":10},"name":"linkerd-proxy","ports":[{"containerPort":4143,"name":"linkerd-proxy"},{"containerPort":4191,"name":"linkerd-admin"}],"readinessProbe":{"httpGet":{"path":"/ready","port":4191},"initialDelaySeconds":2},"resources":{"limits":{"memory":"250Mi"},"requests":{"cpu":"100m","memory":"20Mi"}},"securityContext":{"allowPrivilegeEscalation":false,"capabilities":{"add":["NET_BIND_SERVICE"],"drop":["ALL"]},"readOnlyRootFilesystem":true,"runAsUser":2102},"terminationMessagePolicy":"FallbackToLogsOnError","volumeMounts":[{"mountPath":"/var/run/linkerd/identity/end-entity","name":"linkerd-identity-end-entity"}]}],"dnsPolicy":"ClusterFirst","initContainers":[{"args":["--incoming-proxy-port","4143","--outgoing-proxy-port","4140","--proxy-uid","2102","--inbound-ports-to-ignore","4190,4191"],"image":"gcr.io/linkerd-io/proxy-init:v1.3.3","imagePullPolicy":"IfNotPresent","name":"linkerd-init","resources":{"limits":{"cpu":"100m","memory":"50Mi"},"requests":{"cpu":"10m","memory":"10Mi"}},"securityContext":{"allowPrivilegeEscalation":false,"capabilities":{"add":["NET_ADMIN","NET_RAW","NET_BIND_SERVICE"],"drop":["ALL"]},"privileged":false,"readOnlyRootFilesystem":true,"runAsNonRoot":false,"runAsUser":0},"terminationMessagePolicy":"FallbackToLogsOnError"}],"restartPolicy":"Always","schedulerName":"default-scheduler","securityContext":{},"serviceAccount":"ingress-nginx-ingress","serviceAccountName":"ingress-nginx-ingress","terminationGracePeriodSeconds":60,"volumes":[{"emptyDir":{"medium":"Memory"},"name":"linkerd-identity-end-entity"}]}}},"status":{"availableReplicas":3,"conditions":[{"message":"Deployment has minimum availability.","reason":"MinimumReplicasAvailable","status":"True","type":"Available"},{"message":"ReplicaSet \"ingress-nginx-ingress-controller-7656c98b8f\" has successfully progressed.","reason":"NewReplicaSetAvailable","status":"True","type":"Progressing"}],"observedGeneration":7154,"readyReplicas":3,"replicas":3,"updatedReplicas":3}}
    meta.helm.sh/release-name: ingress
    meta.helm.sh/release-namespace: ingress
  generation: 7161
  labels:
    app: nginx-ingress
    app.kubernetes.io/component: controller
    app.kubernetes.io/managed-by: Helm
    chart: nginx-ingress-1.39.0
    heritage: Helm
    release: ingress
  name: ingress-nginx-ingress-controller
  namespace: ingress
  resourceVersion: "62830529"
  selfLink: /apis/apps/v1/namespaces/ingress/deployments/ingress-nginx-ingress-controller
  uid: c041898e-78dd-11ea-ad31-0e9b9c5b4912
spec:
  progressDeadlineSeconds: 600
  replicas: 3
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: nginx-ingress
      release: ingress
  strategy:
    rollingUpdate:
      maxSurge: 33%
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      annotations:
        ad.datadoghq.com/nginx-ingress-controller.check_names: '["nginx_ingress_controller"]'
        ad.datadoghq.com/nginx-ingress-controller.init_configs: '[{}]'
        ad.datadoghq.com/nginx-ingress-controller.instances: '[{"prometheus_url":
          "http://%%host%%:10254/metrics"}]'
        config.alpha.linkerd.io/proxy-wait-before-exit-seconds: "40"
        kubectl.kubernetes.io/restartedAt: "2020-08-04T16:10:44-04:00"
        linkerd.io/created-by: linkerd/cli stable-2.8.1
        linkerd.io/identity-mode: default
        linkerd.io/proxy-version: stable-2.8.1
      labels:
        app: nginx-ingress
        app.kubernetes.io/component: controller
        component: controller
        linkerd.io/control-plane-ns: linkerd
        linkerd.io/proxy-deployment: ingress-nginx-ingress-controller
        linkerd.io/workload-ns: ingress
        release: ingress
    spec:
      containers:
      - args:
        - /nginx-ingress-controller
        - --default-backend-service=ingress/ingress-nginx-ingress-default-backend
        - --election-id=ingress-controller-leader
        - --ingress-class=nginx
        - --configmap=ingress/ingress-nginx-ingress-controller
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: DD_AGENT_HOST
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.hostIP
        image: quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.32.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz
            port: 10254
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: nginx-ingress-controller
        ports:
        - containerPort: 80
          name: http
          protocol: TCP
        - containerPort: 443
          name: https
          protocol: TCP
        - containerPort: 10254
          name: metrics
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz
            port: 10254
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            memory: 512Mi
          requests:
            cpu: 150m
            memory: 512Mi
        securityContext:
          allowPrivilegeEscalation: true
          capabilities:
            add:
            - NET_BIND_SERVICE
            drop:
            - ALL
          runAsUser: 101
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      - env:
        - name: LINKERD2_PROXY_LOG
          value: warn,linkerd=info
        - name: LINKERD2_PROXY_DESTINATION_SVC_ADDR
          value: linkerd-dst.linkerd.svc.cluster.local:8086
        - name: LINKERD2_PROXY_DESTINATION_GET_NETWORKS
          value: 10.0.0.0/8,172.16.0.0/12,192.168.0.0/16
        - name: LINKERD2_PROXY_CONTROL_LISTEN_ADDR
          value: 0.0.0.0:4190
        - name: LINKERD2_PROXY_ADMIN_LISTEN_ADDR
          value: 0.0.0.0:4191
        - name: LINKERD2_PROXY_OUTBOUND_LISTEN_ADDR
          value: 127.0.0.1:4140
        - name: LINKERD2_PROXY_INBOUND_LISTEN_ADDR
          value: 0.0.0.0:4143
        - name: LINKERD2_PROXY_DESTINATION_GET_SUFFIXES
          value: svc.cluster.local.
        - name: LINKERD2_PROXY_DESTINATION_PROFILE_SUFFIXES
          value: svc.cluster.local.
        - name: LINKERD2_PROXY_INBOUND_ACCEPT_KEEPALIVE
          value: 10000ms
        - name: LINKERD2_PROXY_OUTBOUND_CONNECT_KEEPALIVE
          value: 90000ms
        - name: _pod_ns
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: LINKERD2_PROXY_DESTINATION_CONTEXT
          value: ns:$(_pod_ns)
        - name: LINKERD2_PROXY_IDENTITY_DIR
          value: /var/run/linkerd/identity/end-entity
        - name: LINKERD2_PROXY_IDENTITY_TRUST_ANCHORS
          value: |
            -----BEGIN CERTIFICATE-----
            REDACTED
            -----END CERTIFICATE-----
        - name: LINKERD2_PROXY_IDENTITY_TOKEN_FILE
          value: /var/run/secrets/kubernetes.io/serviceaccount/token
        - name: LINKERD2_PROXY_IDENTITY_SVC_ADDR
          value: linkerd-identity.linkerd.svc.cluster.local:8080
        - name: _pod_sa
          valueFrom:
            fieldRef:
              fieldPath: spec.serviceAccountName
        - name: _l5d_ns
          value: linkerd
        - name: _l5d_trustdomain
          value: cluster.local
        - name: LINKERD2_PROXY_IDENTITY_LOCAL_NAME
          value: $(_pod_sa).$(_pod_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)
        - name: LINKERD2_PROXY_IDENTITY_SVC_NAME
          value: linkerd-identity.$(_l5d_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)
        - name: LINKERD2_PROXY_DESTINATION_SVC_NAME
          value: linkerd-destination.$(_l5d_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)
        - name: LINKERD2_PROXY_TAP_SVC_NAME
          value: linkerd-tap.$(_l5d_ns).serviceaccount.identity.$(_l5d_ns).$(_l5d_trustdomain)
        image: gcr.io/linkerd-io/proxy:stable-2.8.1
        imagePullPolicy: IfNotPresent
        lifecycle:
          preStop:
            exec:
              command:
              - /bin/bash
              - -c
              - sleep 40
        livenessProbe:
          httpGet:
            path: /live
            port: 4191
          initialDelaySeconds: 10
        name: linkerd-proxy
        ports:
        - containerPort: 4143
          name: linkerd-proxy
        - containerPort: 4191
          name: linkerd-admin
        readinessProbe:
          httpGet:
            path: /ready
            port: 4191
          initialDelaySeconds: 2
        resources:
          limits:
            memory: 250Mi
          requests:
            cpu: 100m
            memory: 20Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            add:
            - NET_BIND_SERVICE
            drop:
            - ALL
          readOnlyRootFilesystem: true
          runAsUser: 2102
        terminationMessagePolicy: FallbackToLogsOnError
        volumeMounts:
        - mountPath: /var/run/linkerd/identity/end-entity
          name: linkerd-identity-end-entity
      dnsPolicy: ClusterFirst
      initContainers:
      - args:
        - --incoming-proxy-port
        - "4143"
        - --outgoing-proxy-port
        - "4140"
        - --proxy-uid
        - "2102"
        - --inbound-ports-to-ignore
        - 4190,4191
        image: gcr.io/linkerd-io/proxy-init:v1.3.3
        imagePullPolicy: IfNotPresent
        name: linkerd-init
        resources:
          limits:
            cpu: 100m
            memory: 50Mi
          requests:
            cpu: 10m
            memory: 10Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            add:
            - NET_ADMIN
            - NET_RAW
            - NET_BIND_SERVICE
            drop:
            - ALL
          privileged: false
          readOnlyRootFilesystem: true
          runAsNonRoot: false
          runAsUser: 0
        terminationMessagePolicy: FallbackToLogsOnError
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: ingress-nginx-ingress
      serviceAccountName: ingress-nginx-ingress
      terminationGracePeriodSeconds: 60
      volumes:
      - emptyDir:
          medium: Memory
        name: linkerd-identity-end-entity
status:
  availableReplicas: 3
  conditions:
  - message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  - message: ReplicaSet "ingress-nginx-ingress-controller-59fd9b7b85" has successfully
      progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  observedGeneration: 7161
  readyReplicas: 3
  replicas: 3
  updatedReplicas: 3
---

but I do have problems with a meshed app trying to reach out to datadog agent.

To help further narrow down the repro steps, do these 502s happen only when the datadog agent is the target service?

I just saw this: https://github.com/hyperium/hyper/issues/2136.

I do imagine that linkerd proxy is using that lib right? According to them it is a keep-alive problem and by setting the keep alive lower than the upstream would do.

My upstream keep-alive timeout is 5s so I set it to 2s for the proxy... No luck so far. I'm going to try and put the proxy at 0ms for the outbound keep-alive timeout so that a new connection is used all the time.

I mitigate all 502 on GETs with nginx retry mechanism. I could also do it on non-idemponent but it is a bit dangerous ATM.
I do have less problems now but I'd still like to fix/understand what is going wrong with the linkerd proxy.

@ihcsim here is an extract of an ingress resource for a test application

apiVersion: v1
items:
- apiVersion: extensions/v1beta1
  kind: Ingress
  metadata:
    annotations:
      acme.cert-manager.io/http01-edit-in-place: "true"
      acme.cert-manager.io/http01-ingress-class: "true"
      cert-manager.io/cluster-issuer: letsencrypt
      certmanager.k8s.io/acme-challenge-type: dns01
      certmanager.k8s.io/acme-dns01-provider: route53
      certmanager.k8s.io/cluster-issuer: letsencrypt
      external-dns.alpha.kubernetes.io/target: REDACTED.
      kubernetes.io/ingress.class: nginx
      kubernetes.io/tls-acme: "true"
      meta.helm.sh/release-name: hello-k8s
      meta.helm.sh/release-namespace: hello-k8s
      nginx.ingress.kubernetes.io/configuration-snippet: |
        proxy_set_header l5d-dst-override $service_name.$namespace.svc.cluster.local:$service_port;
        grpc_set_header l5d-dst-override $service_name.$namespace.svc.cluster.local:$service_port;
      nginx.ingress.kubernetes.io/proxy-next-upstream: error timeout http_502 non_idempotent
      nginx.ingress.kubernetes.io/proxy-next-upstream-timeout: 30s
      nginx.ingress.kubernetes.io/proxy-next-upstream-tries: "3"
    creationTimestamp: "2020-04-08T13:45:30Z"
    generation: 1
    labels:
      app: hello-k8s
      app.kubernetes.io/instance: hello-k8s
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: app
      branch-slug: master
      cci-build-number: "3877"
      cci-workflow-id: 1a3b6a13-6400-41e6-a9a0-3eafa8d420d8
      component: app
      helm.sh/chart: app-2.6.0
      place: ca
      pr-number: ""
      sha: 89e64f20862883f79fe25347958459076e281f4d
      short-sha: 89e64f2
      stage: prod
      tag: v0.19.2
      version: v0.19.2
    name: app
    namespace: hello-k8s
    resourceVersion: "63104678"
    selfLink: /apis/extensions/v1beta1/namespaces/hello-k8s/ingresses/app
    uid: 3639f2c6-799f-11ea-ad31-0e9b9c5b4912
  spec:
    rules:
    - host: hello-k8s.example.com
      http:
        paths:
        - backend:
            serviceName: app
            servicePort: http
    tls:
    - hosts:
      - hello-k8s.example.com
      secretName: hello-k8s.example.com-tls
  status:
    loadBalancer:
      ingress:
      - ip: x.x.x.x
      - ip: x.x.x.x
      - ip: x.x.x.x
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

I'm going to try and put the proxy at 0ms for the outbound keep-alive timeout so that a new connection is used all the time.

discovered that 0ms is not supported.

I also tried to raise keep-alive timeout to 90s (higher that nginx outbound of 60s) without any luck either.

@mbelang and I had a chance to talk through this issue in Slack this morning. I think we have a good enough handle on it to put together a repro setup:

  • nginx ingress, injected with proxy
  • app with python HTTP server, uninjected

Then, we should try putting consistent load on the ingress. Ideally, we'd test this all on EKS with the latest AWS CNI, as it seems plausible that it's a bad interaction at the network layer.

If we can reproduce this with this kind of setup, then I think it should be pretty straightforward to diagnose/fix. If we can't, we can start digging into more details about how this repro setup differs from @mbelang's actual system.

@mbelang reports that this problem goes away when all pods are meshed; so this points strongly to the HTTP/1.1 client.

We're seeing this breaking DNS in the cluster for us at the moment where stuff tries to use TCP, including the linkerd-proxy instances. Here's an example of a CURL from a non-meshed pod showing the port opens fine (normal no-talk hangup after a while)

```curl -vvv 172.20.0.10:53

  • Expire in 0 ms for 6 (transfer 0x5615dec03f50)
  • Trying 172.20.0.10...
  • TCP_NODELAY set
  • Expire in 200 ms for 4 (transfer 0x5615dec03f50)
  • Connected to 172.20.0.10 (172.20.0.10) port 53 (#0)

GET / HTTP/1.1
Host: 172.20.0.10:53
User-Agent: curl/7.64.0
Accept: /

  • Empty reply from server
  • Connection #0 to host 172.20.0.10 left intact
    curl: (52) Empty reply from server
Meshed pod gets bad gateways:

curl -vvv 172.20.0.10:53

  • Rebuilt URL to: 172.20.0.10:53/
  • Trying 172.20.0.10...
  • TCP_NODELAY set
  • Connected to 172.20.0.10 (172.20.0.10) port 53 (#0)

GET / HTTP/1.1
Host: 172.20.0.10:53
User-Agent: curl/7.52.1
Accept: /

< HTTP/1.1 502 Bad Gateway
< content-length: 0
< date: Thu, 17 Sep 2020 22:24:10 GMT
<

  • Curl_http_done: called premature == 0
  • Connection #0 to host 172.20.0.10 left intact
Meshing DNS isn't an option for us. Environment is edge-20.9.1 running Kubernetes 1.17 on AWS EKS. Direct port forwards all work to the DNS pods, talking direct to services works - but interestingly this seems to be preventign the proxy itself establishing identities for things:

linkerd-proxy [149249.354645611s] WARN ThreadId(01) trust_dns_proto::xfer::dns_exchange: io_stream hit an error, shutting down: io error: Connection reset by peer (os error 104)
linkerd-proxy [149253.519311877s] WARN ThreadId(01) trust_dns_proto::xfer::dns_exchange: io_stream hit an error, shutting down: io error: Connection reset by peer (os error 104)
```

@olix0r I still notice some WARN inbound:accept{peer.addr=10.4.25.103:55298}:source{target.addr=10.4.25.215:80}: linkerd2_app_core::errors: Failed to proxy request: connection error: Connection reset by peer (os error 104) from time to time. I'm still unsure if it is the same problem yet but we didn't change anything in our configs since I meshed all apps.

@steve-gray We think what you are seeing is closer to https://github.com/linkerd/linkerd2/issues/4831. There is some ongoing investigation per https://github.com/linkerd/linkerd2/issues/4831#issuecomment-678460148. Feel free to subscribe to that issue.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

Even if all my services are meshed, I still have intermittent 502s. I can mitigate with retries at the client level but it is not suitable. Also, I could setup retries with the mesh itself but Linkerd doesn't support retries with body...

hi @mbelang have you been able to reproduce this with 2.9.1?

You're right that retries only work for GET requests at the moment. There is an open issue (#3985) that we'd love help with, if you're interested.

@cpretzer I have yet to update to 2.9.1. Were there any fix or improvements related to the proxy with regard to this issue?

I'm not confident that retries would help in this case. It really depends on where we're encountering the issue, but I don't think we have enough data to know yet.

There were substantial changes between 2.8 and 2.9, especially around caching and discovery. (For instance, there's no longer any DNS resolution in the data path). It would be good to test this more recent version if only to ensure that the problem does not persists -- even if we are able to identify the underlying cause, we're unlikely to backport fixes onto 2.8.

If the issue persists, it would be helpful to at least get debug logs from both the client and server proxies, via config.linkerd.io/proxy-log-level: linkerd=debug,warn

@olix0r I know that retries would not solve the issue but would at least _mitigate_. I will plan an upgrade to 2.9.1 and see how it goes from there.

I did try to put the proxy in Debug mode but I didn't manage to get more information that I posted here. Maybe I missed it but it is a fairly rare event that it is very hard to catch and I do not want to debug that in production cluster though I suspect that the elasticity of the production cluster could have an impact on the issue.

I came across the thread because I was also running into an issue with using linkerd and the Datadog agent together. In the setup I'm using, Datadog is installed as a daemonset using this Helm chart so it is not meshed.

I get similar errors as described above. These logs are coming from the Go Datadog client:

2021/01/14 00:19:20 Datadog Tracer v1.27.0 ERROR: lost 2 traces: Bad Gateway, 11 additional messages skipped (first occurrence: 14 Jan 21 00:18 UTC)%                                                         

If I disable linkerd, then I no longer have any communications issues with the Datadog agent.

I ended up meshing datadog pods as well and that resulted in no more 502s from apps to datadog agent. I do see some 502s from datadog agent to linkerd proxy metrics collection api too. And I suspect this is why I have some metrics in datadog that miss some requests. I haven't got the chance to upgrade to 2.9.1 yet but I will soon.

@olix0r any reason you tagged the issue for 2.10 release?

@mbelang

any reason you tagged the issue for 2.10 release?

I want to make sure that we take a deeper look at issues like this before we cut another stable release.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

klingerf picture klingerf  ยท  3Comments

franziskagoltz picture franziskagoltz  ยท  3Comments

ihcsim picture ihcsim  ยท  4Comments

tustvold picture tustvold  ยท  4Comments

geekmush picture geekmush  ยท  4Comments