I have an Nginx service, serving static files and some locations with proxy_pass that fails with timeouts.
Lots of timeouts on an Nginx service
# Linkerd-proxy:
“ERR! proxy={server=in listen=0.0.0.0:4143 remote=10.128.0.24:63489} linkerd2_proxy::proxy::http::router service error: an IO error occurred: Connection reset by peer (os error 104)
# nginx:
2018/12/19 18:35:22 [error] 9#9: *172617 upstream timed out (110: Operation timed out) while connecting to upstream, client: 127.0.0.1, server: _, request: “GET /cookie/bundle.js HTTP/1.1”, upstream:
linkerd check outputkubernetes-api: can initialize the client..................................[ok]
kubernetes-api: can query the Kubernetes API...............................[ok]
kubernetes-api: is running the minimum Kubernetes API version..............[ok]
linkerd-api: control plane namespace exists................................[ok]
linkerd-api: control plane pods are ready..................................[ok]
linkerd-api: can initialize the client.....................................[ok]
linkerd-api: can query the control plane API...............................[ok]
linkerd-api[kubernetes]: control plane can talk to Kubernetes..............[ok]
linkerd-api[prometheus]: control plane can talk to Prometheus..............[ok]
linkerd-api: no invalid service profiles...................................[ok]
linkerd-version: can determine the latest version..........................[ok]
linkerd-version: cli is up-to-date.........................................[ok]
linkerd-version: control plane is up-to-date...............................[ok]
Status check results are [ok]
@vic3lord Thanks for opening this. This sounds like a duplicate of #1537. Can you check out the remediation steps mentioned in that issue to see if it fixes your setup?
@klingerf thanks for the quick response, I saw the error logs from #1537 and the logs are not the same as I have, plus I don't use nginx-ingress in front of this service, it's a GLBC ingress and the service itself is an nginx.
EDIT: P.s the fixes are not applicable since it's not ingress.
@vic3lord Ah, ok, apologies for misreading it. It would be really helpful if you could provide a Kubernetes config that reproduces the issue that you're seeing when it's injected with the linkerd proxy. For instance, it could be a modified version of one of our test yaml files that includes an nginx frontend that serves static assets and uses proxy_pass. That will make it a lot easier for us to track down what's going on.
of course!
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: cdn
namespace: default
labels:
app: cdn
spec:
replicas: 3
revisionHistoryLimit: 1
selector:
matchLabels:
app: cdn
template:
metadata:
labels:
app: cdn
spec:
containers:
- name: cdn
image: nginx:alpine
volumeMounts:
- name: vhost
mountPath: /etc/nginx/nginx.conf
subPath: nginx.conf
ports:
- name: http
containerPort: 80
readinessProbe:
httpGet:
path: /healthz
port: http
livenessProbe:
httpGet:
path: /healthz
port: http
initialDelaySeconds: 60
resources:
limits:
cpu: 1
memory: 512Mi
volumes:
- name: vhost
configMap:
name: cdn
---
apiVersion: v1
kind: Service
metadata:
name: cdn
namespace: default
labels:
app: cdn
spec:
type: NodePort
selector:
app: cdn
ports:
- name: http
port: 80
targetPort: 80
---
apiVersion: v1
kind: ConfigMap
metadata:
name: cdn
namespace: default
data:
nginx.conf: |+
user nginx;
worker_processes 1;
error_log /var/log/nginx/error.log warn;
pid /var/run/nginx.pid;
events {
worker_connections 1024;
}
http {
include /etc/nginx/mime.types;
# add extra types support
types {
font/ttf ttf;
font/opentype otf;
font/woff woff;
font/woff2 woff2;
}
default_type application/octet-stream;
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';
access_log off;
sendfile on;
tcp_nopush on;
keepalive_timeout 65;
map $sent_http_content_type $expires {
default off;
text/html 1h;
text/css max;
application/javascript 1h;
~image/ max;
~font/ max;
}
server {
listen 80;
server_name _;
gzip on;
gzip_vary on;
gzip_proxied any;
gzip_types "*";
location = /healthz {
access_log off;
return 200 "OK";
}
if ($request_method !~ "OPTIONS|GET|HEAD") {
return 405;
}
location / {
access_log off;
return 200 "OK";
}
location /js/ {
add_header Cache-Control "public,s-maxage=120,max-age=300";
proxy_pass http://sdk.default.svc.cluster.local/js/;
}
location /js/assets/ {
expires $expires;
add_header Cache-Control "public";
proxy_pass http://sdk.default.svc.cluster.local/js/assets/;
}
location /fonts/ {
expires $expires;
add_header Cache-Control "public";
add_header Access-Control-Allow-Origin "*";
proxy_pass http://fonts.default.svc.cluster.local/;
}
location /cookie/ {
expires $expires;
proxy_pass http://cookie-iframe.default.svc.cluster.local/;
}
location /img/ {
expires $expires;
add_header Cache-Control "public";
proxy_pass http://imageflow.default.svc.cluster.local:3000/img/;
}
}
@vic3lord Thanks! That config doesn't apply in my env. The nginx pods exit with:
2018/12/21 18:42:50 [emerg] 1#1: unexpected end of file, expecting "}" in /etc/nginx/nginx.conf:94
nginx: [emerg] unexpected end of file, expecting "}" in /etc/nginx/nginx.conf:94
But I came up with a working nginx config that uses proxy_pass, and I can't replicate the timeout issue that you're seeing. Here's what I did:
Install the linkerd control plane
linkerd install | kubectl apply -f -
Inject and install the "hello" backend
linkerd inject hello.yml | kubectl apply -f -
hello.yml
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: hello
spec:
replicas: 3
template:
metadata:
labels:
app: hello
spec:
containers:
- name: service
image: buoyantio/helloworld:0.1.6
args:
- "-addr=:7777"
- "-text=Hello"
ports:
- name: http
containerPort: 7777
---
apiVersion: v1
kind: Service
metadata:
name: hello
spec:
selector:
app: hello
clusterIP: None
ports:
- name: http
port: 7777
Inject and install nginx
linkerd inject nginx.yml | kubectl apply -f -
nginx.yml
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: cdn
namespace: default
labels:
app: cdn
spec:
replicas: 3
revisionHistoryLimit: 1
selector:
matchLabels:
app: cdn
template:
metadata:
labels:
app: cdn
spec:
containers:
- name: cdn
image: nginx:alpine
volumeMounts:
- name: vhost
mountPath: /etc/nginx/nginx.conf
subPath: nginx.conf
ports:
- name: http
containerPort: 80
readinessProbe:
httpGet:
path: /healthz
port: http
livenessProbe:
httpGet:
path: /healthz
port: http
initialDelaySeconds: 60
resources:
limits:
cpu: 1
memory: 512Mi
volumes:
- name: vhost
configMap:
name: cdn
---
apiVersion: v1
kind: Service
metadata:
name: cdn
namespace: default
labels:
app: cdn
spec:
type: NodePort
selector:
app: cdn
ports:
- name: http
port: 80
targetPort: 80
---
apiVersion: v1
kind: ConfigMap
metadata:
name: cdn
namespace: default
data:
nginx.conf: |+
user nginx;
worker_processes 1;
error_log /var/log/nginx/error.log warn;
pid /var/run/nginx.pid;
events {
worker_connections 1024;
}
http {
include /etc/nginx/mime.types;
# add extra types support
types {
font/ttf ttf;
font/opentype otf;
font/woff woff;
font/woff2 woff2;
}
default_type application/octet-stream;
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';
access_log off;
sendfile on;
tcp_nopush on;
keepalive_timeout 65;
map $sent_http_content_type $expires {
default off;
text/html 1h;
text/css max;
application/javascript 1h;
~image/ max;
~font/ max;
}
server {
listen 80;
server_name _;
gzip on;
gzip_vary on;
gzip_proxied any;
gzip_types "*";
location = /healthz {
access_log off;
return 200 "OK";
}
if ($request_method !~ "OPTIONS|GET|HEAD") {
return 405;
}
location / {
access_log off;
return 200 "OK";
}
location /hello/ {
expires $expires;
add_header Cache-Control "public";
proxy_pass http://hello.default.svc.cluster.local:7777/;
}
}
}
Port-forward to nginx
kubectl port-forward svc/cdn 8080:80
Curl nginx
$ curl localhost:8080
OK
Curl the hello service by way of nginx
$ curl localhost:8080/hello/
Hello!
And that all works for me. Can you try it in your environment?
Hi @klingerf I'm sorry I just missed a brace when copying, this service is running in production for the past 254 days with 1k rps without linkerd
The problem with the timeouts is that they are not consistent and come after running a few hours, I saw someone posted another issue with something similar about a memory leak in the proxy container I think it's related #2012
It happens only under high traffic that's why I couldnt replicate it on staging. I've been running linkerd in stage for the past month and only after verifying that everything works I moved to production, that's when I found all sorts of issues...
Thanks again for your help, LMK if you need anything else on my end.
@klingerf I have seen similar behavior as @vic3lord said, few hours after injecting linkerd2 to nginx, memory was around 3+ and CPU 100%, crashed multiples times then finally stopped working,
we removed linkerd2 from nginx as a temporary fix. any solutions for this issue will help us.
@vic3lord @etsrepo Thanks for the additional details. I didn't realize from reading the initial description of this issue that the timeouts only happen after a few hours of high traffic. I agree that this sounds similar to the reports in #2012.
The fix for #2012 was shipped with the edge-19.1.1 release. @vic3lord, @etsrepo, can you try upgrading the linkerd proxies in your nginx setups to see if that fixes this issue?
I injected into few services, will monitor closely for the next few days and close the issue if everything is fine
@klingerf I won't be able to test nginx deployment, we had removed linkerd from nginx deployment.
I will update if we can inject again.
Hi @klingerf,
I injected the proxy into a few nginx services for a few days, everything seems to work fine except for one service which is not "internal" service it's serving users from our GCE ingress, and has more traffic than others
This is the error I get from nginx after injecting
1024 worker_connections are not enough
And from linkerd-proxy of this pod
WARN admin={bg=resolver} linkerd2_proxy::control::destination::background::destination_set Destination.Get stream errored for NameAddr { name: DnsName(DNSName("sdk.default.svc.cluster.local")), port: 80 }: Grpc(Status { code: Unknown, error_message: "", binary_error_details: b"" })
@vic3lord Thanks for checking it out and reporting back! The new issue that you're seeing is described in #2118. Please watch that issue for a fix, and I'll close this one out in the meantime.