Linkerd2: Nginx timeouts when proxy is injected

Created on 19 Dec 2018 · 13Comments · Source: linkerd/linkerd2

Bug Report

I have an Nginx service, serving static files and some locations with proxy_pass that fails with timeouts.

What is the issue?

Lots of timeouts on an Nginx service

How can it be reproduced?

Logs, error output, etc

# Linkerd-proxy: 
“ERR! proxy={server=in listen=0.0.0.0:4143 remote=10.128.0.24:63489} linkerd2_proxy::proxy::http::router service error: an IO error occurred: Connection reset by peer (os error 104)

# nginx:

2018/12/19 18:35:22 [error] 9#9: *172617 upstream timed out (110: Operation timed out) while connecting to upstream, client: 127.0.0.1, server: _, request: “GET /cookie/bundle.js HTTP/1.1”, upstream:

`linkerd check` output

kubernetes-api: can initialize the client..................................[ok]
kubernetes-api: can query the Kubernetes API...............................[ok]
kubernetes-api: is running the minimum Kubernetes API version..............[ok]
linkerd-api: control plane namespace exists................................[ok]
linkerd-api: control plane pods are ready..................................[ok]
linkerd-api: can initialize the client.....................................[ok]
linkerd-api: can query the control plane API...............................[ok]
linkerd-api[kubernetes]: control plane can talk to Kubernetes..............[ok]
linkerd-api[prometheus]: control plane can talk to Prometheus..............[ok]
linkerd-api: no invalid service profiles...................................[ok]
linkerd-version: can determine the latest version..........................[ok]
linkerd-version: cli is up-to-date.........................................[ok]
linkerd-version: control plane is up-to-date...............................[ok]

Status check results are [ok]

Environment

Kubernetes Version: 1.11.2
Cluster Environment: GKE
Host OS: COS
Linkerd version: 2.1

Possible solution

Additional context

needmore

Source

vic3lord

All 13 comments

@vic3lord Thanks for opening this. This sounds like a duplicate of #1537. Can you check out the remediation steps mentioned in that issue to see if it fixes your setup?

klingerf on 19 Dec 2018

@klingerf thanks for the quick response, I saw the error logs from #1537 and the logs are not the same as I have, plus I don't use nginx-ingress in front of this service, it's a GLBC ingress and the service itself is an nginx.

EDIT: P.s the fixes are not applicable since it's not ingress.

vic3lord on 20 Dec 2018

@vic3lord Ah, ok, apologies for misreading it. It would be really helpful if you could provide a Kubernetes config that reproduces the issue that you're seeing when it's injected with the linkerd proxy. For instance, it could be a modified version of one of our test yaml files that includes an nginx frontend that serves static assets and uses proxy_pass. That will make it a lot easier for us to track down what's going on.

klingerf on 21 Dec 2018

of course!

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cdn
  namespace: default
  labels:
    app: cdn
spec:
  replicas: 3
  revisionHistoryLimit: 1
  selector:
    matchLabels:
      app: cdn
  template:
    metadata:
      labels:
        app: cdn
    spec:
      containers:
        - name: cdn
          image: nginx:alpine
          volumeMounts:
            - name: vhost
              mountPath: /etc/nginx/nginx.conf
              subPath: nginx.conf
          ports:
            - name: http
              containerPort: 80
          readinessProbe:
            httpGet:
              path: /healthz
              port: http
          livenessProbe:
            httpGet:
              path: /healthz
              port: http
            initialDelaySeconds: 60
          resources:
            limits:
              cpu: 1
              memory: 512Mi
      volumes:
        - name: vhost
          configMap:
            name: cdn
---
apiVersion: v1
kind: Service
metadata:
  name: cdn
  namespace: default
  labels:
    app: cdn
spec:
  type: NodePort
  selector:
    app: cdn
  ports:
    - name: http
      port: 80
      targetPort: 80
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: cdn
  namespace: default
data:
  nginx.conf: |+
    user  nginx;
    worker_processes  1;

    error_log  /var/log/nginx/error.log warn;
    pid        /var/run/nginx.pid;

    events {
      worker_connections  1024;
    }

    http {
      include       /etc/nginx/mime.types;

      # add extra types support
      types {
        font/ttf                      ttf;
        font/opentype                 otf;
        font/woff                     woff;
        font/woff2                    woff2;
      }

      default_type  application/octet-stream;

      log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
      '$status $body_bytes_sent "$http_referer" '
      '"$http_user_agent" "$http_x_forwarded_for"';

      access_log off;
      sendfile on;
      tcp_nopush on;
      keepalive_timeout  65;

      map $sent_http_content_type $expires {
        default                    off;
        text/html                  1h;
        text/css                   max;
        application/javascript     1h;
        ~image/                    max;
        ~font/                     max;
      }

      server {
        listen 80;
        server_name  _;

        gzip on;
        gzip_vary on;
        gzip_proxied any;
        gzip_types "*";

        location = /healthz {
          access_log off;
          return 200 "OK";
        }

        if ($request_method !~ "OPTIONS|GET|HEAD") {
          return 405;
        }

        location / {
          access_log off;
          return 200 "OK";
        }

        location /js/ {
          add_header Cache-Control "public,s-maxage=120,max-age=300";
          proxy_pass http://sdk.default.svc.cluster.local/js/;
        }

        location /js/assets/ {
          expires $expires;
          add_header Cache-Control "public";
          proxy_pass http://sdk.default.svc.cluster.local/js/assets/;
        }

        location /fonts/ {
          expires $expires;
          add_header Cache-Control "public";
          add_header Access-Control-Allow-Origin "*";
          proxy_pass http://fonts.default.svc.cluster.local/;
        }

        location /cookie/ {
          expires $expires;
          proxy_pass http://cookie-iframe.default.svc.cluster.local/;
        }

        location /img/ {
          expires $expires;
          add_header Cache-Control "public";
          proxy_pass http://imageflow.default.svc.cluster.local:3000/img/;
        }
    }

vic3lord on 21 Dec 2018

@vic3lord Thanks! That config doesn't apply in my env. The nginx pods exit with:

2018/12/21 18:42:50 [emerg] 1#1: unexpected end of file, expecting "}" in /etc/nginx/nginx.conf:94
nginx: [emerg] unexpected end of file, expecting "}" in /etc/nginx/nginx.conf:94

But I came up with a working nginx config that uses proxy_pass, and I can't replicate the timeout issue that you're seeing. Here's what I did:

Install the linkerd control plane
```
linkerd install | kubectl apply -f -
```

Inject and install the "hello" backend

linkerd inject hello.yml | kubectl apply -f -

hello.yml

---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: hello
spec:
  replicas: 3
  template:
    metadata:
      labels:
        app: hello
    spec:
      containers:
      - name: service
        image: buoyantio/helloworld:0.1.6
        args:
        - "-addr=:7777"
        - "-text=Hello"
        ports:
        - name: http
          containerPort: 7777
---
apiVersion: v1
kind: Service
metadata:
  name: hello
spec:
  selector:
    app: hello
  clusterIP: None
  ports:
  - name: http
    port: 7777

Inject and install nginx

linkerd inject nginx.yml | kubectl apply -f -

nginx.yml

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cdn
  namespace: default
  labels:
    app: cdn
spec:
  replicas: 3
  revisionHistoryLimit: 1
  selector:
    matchLabels:
      app: cdn
  template:
    metadata:
      labels:
        app: cdn
    spec:
      containers:
        - name: cdn
          image: nginx:alpine
          volumeMounts:
            - name: vhost
              mountPath: /etc/nginx/nginx.conf
              subPath: nginx.conf
          ports:
            - name: http
              containerPort: 80
          readinessProbe:
            httpGet:
              path: /healthz
              port: http
          livenessProbe:
            httpGet:
              path: /healthz
              port: http
            initialDelaySeconds: 60
          resources:
            limits:
              cpu: 1
              memory: 512Mi
      volumes:
        - name: vhost
          configMap:
            name: cdn
---
apiVersion: v1
kind: Service
metadata:
  name: cdn
  namespace: default
  labels:
    app: cdn
spec:
  type: NodePort
  selector:
    app: cdn
  ports:
    - name: http
      port: 80
      targetPort: 80
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: cdn
  namespace: default
data:
  nginx.conf: |+
    user  nginx;
    worker_processes  1;

    error_log  /var/log/nginx/error.log warn;
    pid        /var/run/nginx.pid;

    events {
      worker_connections  1024;
    }

    http {
      include       /etc/nginx/mime.types;

      # add extra types support
      types {
        font/ttf                      ttf;
        font/opentype                 otf;
        font/woff                     woff;
        font/woff2                    woff2;
      }

      default_type  application/octet-stream;

      log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
      '$status $body_bytes_sent "$http_referer" '
      '"$http_user_agent" "$http_x_forwarded_for"';

      access_log off;
      sendfile on;
      tcp_nopush on;
      keepalive_timeout  65;

      map $sent_http_content_type $expires {
        default                    off;
        text/html                  1h;
        text/css                   max;
        application/javascript     1h;
        ~image/                    max;
        ~font/                     max;
      }

      server {
        listen 80;
        server_name  _;

        gzip on;
        gzip_vary on;
        gzip_proxied any;
        gzip_types "*";

        location = /healthz {
          access_log off;
          return 200 "OK";
        }

        if ($request_method !~ "OPTIONS|GET|HEAD") {
          return 405;
        }

        location / {
          access_log off;
          return 200 "OK";
        }

        location /hello/ {
          expires $expires;
          add_header Cache-Control "public";
          proxy_pass http://hello.default.svc.cluster.local:7777/;
        }
      }
    }

Port-forward to nginx
```
kubectl port-forward svc/cdn 8080:80
```
Curl nginx
```
$ curl localhost:8080
OK
```
Curl the hello service by way of nginx
```
$ curl localhost:8080/hello/
Hello!
```

And that all works for me. Can you try it in your environment?

klingerf on 21 Dec 2018

Hi @klingerf I'm sorry I just missed a brace when copying, this service is running in production for the past 254 days with 1k rps without linkerd

The problem with the timeouts is that they are not consistent and come after running a few hours, I saw someone posted another issue with something similar about a memory leak in the proxy container I think it's related #2012

It happens only under high traffic that's why I couldnt replicate it on staging. I've been running linkerd in stage for the past month and only after verifying that everything works I moved to production, that's when I found all sorts of issues...

Thanks again for your help, LMK if you need anything else on my end.

vic3lord on 22 Dec 2018

👍1

@klingerf I have seen similar behavior as @vic3lord said, few hours after injecting linkerd2 to nginx, memory was around 3+ and CPU 100%, crashed multiples times then finally stopped working,

we removed linkerd2 from nginx as a temporary fix. any solutions for this issue will help us.

chandanpasunoori on 3 Jan 2019

@vic3lord @etsrepo Thanks for the additional details. I didn't realize from reading the initial description of this issue that the timeouts only happen after a few hours of high traffic. I agree that this sounds similar to the reports in #2012.

klingerf on 3 Jan 2019

👍1

The fix for #2012 was shipped with the edge-19.1.1 release. @vic3lord, @etsrepo, can you try upgrading the linkerd proxies in your nginx setups to see if that fixes this issue?

klingerf on 14 Jan 2019

👍1

I injected into few services, will monitor closely for the next few days and close the issue if everything is fine

vic3lord on 15 Jan 2019

👍1

@klingerf I won't be able to test nginx deployment, we had removed linkerd from nginx deployment.
I will update if we can inject again.

chandanpasunoori on 16 Jan 2019

Hi @klingerf,
I injected the proxy into a few nginx services for a few days, everything seems to work fine except for one service which is not "internal" service it's serving users from our GCE ingress, and has more traffic than others

This is the error I get from nginx after injecting

1024 worker_connections are not enough

And from linkerd-proxy of this pod

WARN admin={bg=resolver} linkerd2_proxy::control::destination::background::destination_set Destination.Get stream errored for NameAddr { name: DnsName(DNSName("sdk.default.svc.cluster.local")), port: 80 }: Grpc(Status { code: Unknown, error_message: "", binary_error_details: b"" })

vic3lord on 20 Jan 2019

@vic3lord Thanks for checking it out and reporting back! The new issue that you're seeing is described in #2118. Please watch that issue for a fix, and I'll close this one out in the meantime.

klingerf on 23 Jan 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Wire up stats and dashboards for Jobs

klingerf · 3Comments

Typo in check error message

adleong · 4Comments

change css variables for color to be more descriptive

franziskagoltz · 3Comments

Add validation to the New Service Profile popup form

alpeb · 3Comments

Developer Documentation

manimaul · 3Comments