Ingress-nginx: Regular crash of ngnix ingress controller pods after upgrade to 0.20.0 image

Created on 22 Nov 2018 · 28Comments · Source: kubernetes/ingress-nginx

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/.):

What keywords did you search in NGINX Ingress controller issues before filing this one? (If you have found any duplicates, you should instead reply there.):

Is this a BUG REPORT or FEATURE REQUEST? (choose one):

NGINX Ingress controller version:
0.20.0

Kubernetes version (use kubectl version):
1.9.6

Environment:
Production

Cloud provider or hardware configuration:
Azure
OS (e.g. from /etc/os-release):
Ubuntu 16.04
Kernel (e.g. uname -a):
Linux k8s-master-81594228-1 4.13.0-1012-azure #15-Ubuntu SMP Thu Mar 8 10:47:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Install tools:
acs engine, ansible, terraform
Others:

What happened:
After upgrading the Nginx controller version from 0.15.0 to 0.20.0, the nginx ingress controller pods are regularly crashing after several timeout messages on the liveness probe. The nginx ingress controller pods are installed on separate VMs as all other pods. We need 0.20.0 version because we want to activate use-forwarded-headers: "false" in nginx config map to avoid the security vulnerability (user forging the headers to bypass the whitelist of nginx).

What you expected to happen:
Stable behavior of nginx ingress controller pods as in version 0.15.0.

How to reproduce it (as minimally and precisely as possible):
Update the image quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.15.0 to
quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.20.0 on existent nginx ingress controller deployment.

Anything else we need to know:
Logs from the events:
2018-11-21 17:24:25 +0100 CET 2018-11-21 17:23:05 +0100 CET 6 nginx-ingress-controller-7d47db4569-9bxtz.1569303cf3aebbba Pod spec.containers{nginx-ingress-controller} Warning Unhealthy kubelet, k8s-dmz-81594228-0 Liveness probe failed: Get http://xx.xx.xx.xx:10254/healthz:: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2018-11-21 17:24:26 +0100 CET 2018-11-21 17:24:26 +0100 CET 1 nginx-ingress-controller-7d47db4569-9bxtz.1569304fae92655c Pod spec.containers{nginx-ingress-controller} Normal Killing kubelet, k8s-dmz-81594228-0 Killing container with id docker://nginx-ingress-controller:Container failed liveness probe.. Container will be killed and recreated.

We have tried to increase the timeoutSeconds on liveness probe to 4s and also to add - --enable-dynamic-configuration=false in the nginx deployment.
With this configuration, the number of timeouts decreased, but after a certain charge from apps on the platform, the timeouts become more regular.

Logs from nginx pods in debug mode and timeout 3sec:
{"log":"E1121 13:30:34.808413 5 controller.go:232] Error getting ConfigMap \"kube-system/udp-services\": no object matching key \"kube-system/udp-services\" in local store\n","stream":"stderr","time":"2018-11-21T13:30:34.818557076Z"}
{"log":"I1121 13:30:37.500168 5 main.go:158] Received SIGTERM, shutting down\n","stream":"stderr","time":"2018-11-21T13:30:37.501123038Z"}
{"log":"I1121 13:30:37.500229 5 nginx.go:340] Shutting down controller queues\n","stream":"stderr","time":"2018-11-21T13:30:37.501167238Z"}
{"log":"I1121 13:30:37.500276 5 nginx.go:348] Stopping NGINX process\n","stream":"stderr","time":"2018-11-21T13:30:37.501203538Z"}

Source

EcaterinaGr

👍3

All 28 comments

@EcaterinaGr please post the start of the log. Maybe you are being affected by you should follow https://github.com/Azure/AKS/issues/435

aledbf on 22 Nov 2018

@aledbf , are you asking for the start of events logs or of the pod itself?
Actually before SIGTERM, we can see the normal logs on the requests done on controllers and no other errors. The SIGTERM is given by the Kubenet (as we can see from the events) but we cannot identify the reason.
We are not using AKS, we use ACS engine and nginx was working well on our platform in the previous version 0.15.0.

Grabbed more logs from the docker container itself before it was killed:
2018/11/21 19:30:15 [alert] 2187#2187: *84403 open socket #47 left in connection 19
2018/11/21 19:30:15 [alert] 2187#2187: aborting
W1121 19:30:22.655014 6 controller.go:773] service utm-demo/todoapi-svc-demo does not have any active endpoints
W1121 19:30:22.655191 6 controller.go:773] service aviation-data-prod/rawstorage-service does not have any active endpoints
W1121 19:30:22.655466 6 controller.go:773] service atfm-dev/innove-gateway-service does not have any active endpoints
W1121 19:30:22.657725 6 controller.go:773] service utm-dev/utm-backend-api-svc-test does not have any active endpoints
W1121 19:30:22.658510 6 controller.go:1020] ssl certificate "aviation-data-prod/aviation-data-prod-tls" does not exist in local store
W1121 19:30:22.658665 6 controller.go:1020] ssl certificate "aviation-data-prod/aviation-data-prod-tls" does not exist in local store
W1121 19:30:22.658768 6 controller.go:1020] ssl certificate "aviation-data-prod/aviation-data-prod-tls" does not exist in local store
W1121 19:30:22.659129 6 controller.go:1020] ssl certificate "aviation-data-prod/aviation-data-prod-tls" does not exist in local store
E1121 19:30:22.660605 6 controller.go:221] unexpected error reading configmap kube-system/tcp-services: configmap kube-system/tcp-services was not found
E1121 19:30:22.660670 6 controller.go:221] unexpected error reading configmap kube-system/udp-services: configmap kube-system/udp-services was not found
I1121 19:30:25.914652 6 main.go:127] Received SIGTERM, shutting down
I1121 19:30:25.914678 6 nginx.go:362] shutting down controller queues
I1121 19:30:25.914709 6 status.go:119] updating status of Ingress rules (remove)
I1121 19:30:25.954696 6 nginx.go:370] stopping NGINX process...
2018/11/21 19:30:25 [warn] 2401#2401: duplicate network "0.0.0.0/0", value: "0", old value: "1" in /etc/nginx/nginx.conf:1400
nginx: [warn] duplicate network "0.0.0.0/0", value: "0", old value: "1" in /etc/nginx/nginx.conf:1400
2018/11/21 19:30:25 [warn] 2401#2401: duplicate network "0.0.0.0/0", value: "0", old value: "1" in /etc/nginx/nginx.conf:1416
nginx: [warn] duplicate network "0.0.0.0/0", value: "0", old value: "1" in /etc/nginx/nginx.conf:1416
2018/11/21 19:30:25 [warn] 2401#2401: duplicate network "0.0.0.0/0", value: "0", old value: "1" in /etc/nginx/nginx.conf:8990
nginx: [warn] duplicate network "0.0.0.0/0", value: "0", old value: "1" in /etc/nginx/nginx.conf:8990
2018/11/21 19:30:25 [warn] 2401#2401: duplicate network "0.0.0.0/0", value: "0", old value: "1" in /etc/nginx/nginx.conf:9006
nginx: [warn] duplicate network "0.0.0.0/0", value: "0", old value: "1" in /etc/nginx/nginx.conf:9006
2018/11/21 19:30:25 [notice] 2401#2401: signal process started

EcaterinaGr on 22 Nov 2018

Hi,
I wanted to note, that I had the same constant crashing with 0.20 on GCP. I updated to 0.21 just now, and it is better, but the crashes still occur. This time I saw in the logs:

I1123 10:25:15.605676       9 backend_ssl.go:189] Updating local copy of SSL certificate "default/prow-tls" with missing intermediate CA certs
I1123 10:29:47.579872       9 main.go:163] Received SIGTERM, shutting down
I1123 10:29:47.579901       9 nginx.go:343] Shutting down controller queues
I1123 10:29:47.579917       9 status.go:200] updating status of Ingress rules (remove)

Demonsthere on 23 Nov 2018

fntlnz on 29 Nov 2018

hello

Any updates regarding the issue?

EcaterinaGr on 11 Dec 2018

As soon as we add certificates (same creation as on our dev clusters) nginx starts to crash.
As long as one is alive it actually works! But is does not take too long to finally crash all instances :/
We do not observer this on Kubernetes 1.11.3 but on 1.12.3

Flowkap on 21 Dec 2018

solution for this bug ? plz thanks

katanyoussef on 5 Jan 2019

We have been experiencing similar issues where we see:

  Warning  Unhealthy  8m5s (x12 over 20m)    kubelet, ip-xxx-eu-central-1.compute.internal  Readiness probe failed: Get http://100.96.5.100:10254/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  8m1s (x11 over 20m)    kubelet, ip-xxx.eu-central-1.compute.internal  Liveness probe failed: Get http://100.96.5.100:10254/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

much more frequently for some of our nginx pods than we did in the past and it is leading to public facing errors, sometimes it seems to manifest as cert errors (we do TLS termination with nginx). I don't know exactly when it started but we started noticing it when we where on 0.20.0 and now we are on 0.22.0 and we are still seeing these issues.

Our probes timeout is set to 1s and we also do not seem to see any error in the nginx controller itself. And as of this time we have no idea why this is happening.

We are running on AWS with a kops cluster on k8s 1.11.6.

Globegitter on 18 Jan 2019

@Globegitter how does memory and cpu utilization look like during that time? It could be network related too.

If it is feasible you can also try to downgrade to the earlier ingress version where it worked well for you and confirm that it is because if ingress-nginx.

ElvinEfendi on 18 Jan 2019

@ElvinEfendi You mean specifically of these nginx pods? Or you mean node/cluster wide? But yeah we still have to investigate further and we have been serving quite a bit more traffic since over a month ago. I just found this issues seemed interestingly fitting but could very much just be coincidence.

Unfortunately we are using a 0.20.0 specific feature so downgrading is going to be a bit more difficult, but depending on how this continues we'll definitely be doing some investigation on that next week.

Globegitter on 18 Jan 2019

You mean specifically of these nginx pods

I guess both. I'm suggesting that to make sure timeout is not because the pod is too busy doing other things and therefore can not respond health checks timely.

ElvinEfendi on 18 Jan 2019

@ElvinEfendi yeah that is a good pointer: The last 2 hours or so we only ever had on pod crashing a fair bit. and it is that pod that has a much higher cpu usage and memory usage than the other pods. So that is very interesting, as to why that one pod would have a much higher usage than our other 3 pods? Further we also have a cluster wide traffic of <2000req/s, so even if that was just hitting one nginx I would still expect that to be able to respond to health probes.

Globegitter on 18 Jan 2019

@Globegitter maybe you can help us to test this running some scripts :)

Get information about the running pods:

kubectl get pod -o wide -n ingress-nginx

SSH to a node in the cluster
Build a list with the IP addresses and run a script

POD_IPS="XX.XXX.XXX XX.XXX.XXX" # using the IPs from kubectl
while true;do
  sleep 1
  for x in $POD_IPS;do
    echo "[$(date)] http://$x:10254/healthz http status:" $(curl -I -o /dev/null -s -w "%{http_code}::%{time_namelookup}::%{time_connect}::%{time_starttransfer}::%{time_total}\n" http://$x:10254/healthz);
  done
done

This will print something like

[Fri Jan 18 15:47:33 UTC 2019] http://100.96.9.248:10254/healthz http status: 200::0.000029::0.000112::0.001546::0.001582
[Fri Jan 18 15:47:33 UTC 2019] http://100.96.3.200:10254/healthz http status: 200::0.000023::0.000817::0.002746::0.002782

This can help us to test what @ElvinEfendi said in the previous comment

Edit: maybe I should put this in a k8s Job to help to debug this issue.

aledbf on 18 Jan 2019

👀1

@aledbf Aah yeah that is a good idea on debugging - we just raised the memory limit and the pods are re-rolling. But if the timeout issue persists / pops up again I will make sure to make use of the debug script and post the results here.

Edit: And a job for that would really be useful.

Globegitter on 18 Jan 2019

@aledbf it is still happening, here some output:

[Fri Jan 18 16:14:08 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000031::0.003766::0.011774::0.011802
[Fri Jan 18 16:14:08 UTC 2019] http://100.96.29.19:10254/healthz http status: 000::0.000026::0.000000::0.000000::0.000084
[Fri Jan 18 16:14:08 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000022::0.000289::2.548471::2.548508
[Fri Jan 18 16:14:10 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000018::0.000719::0.010645::0.010668
[Fri Jan 18 16:14:12 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000021::0.000742::0.002636::0.002657
[Fri Jan 18 16:14:12 UTC 2019] http://100.96.29.19:10254/healthz http status: 500::0.000024::0.000305::0.001203::0.001222
[Fri Jan 18 16:14:12 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000017::0.000239::2.434425::2.434456
[Fri Jan 18 16:14:14 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000017::0.000700::0.003581::0.003603
[Fri Jan 18 16:14:15 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000023::0.000735::0.002499::0.002522
[Fri Jan 18 16:14:15 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000017::0.000260::0.004873::0.004893
[Fri Jan 18 16:14:15 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000015::0.000253::2.936678::2.936709
[Fri Jan 18 16:14:18 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000024::0.000730::0.005554::0.005575
[Fri Jan 18 16:14:19 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000021::0.000835::0.004889::0.004912
[Fri Jan 18 16:14:19 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000029::0.000293::0.004889::0.004909
[Fri Jan 18 16:14:19 UTC 2019] http://100.96.29.81:10254/healthz http status: 500::0.000015::0.000245::0.000943::0.000963
[Fri Jan 18 16:14:19 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000021::0.001020::0.003824::0.003846
[Fri Jan 18 16:14:20 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000018::0.000713::0.004364::0.004391
[Fri Jan 18 16:14:20 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000036::0.000363::0.001800::0.001818
[Fri Jan 18 16:14:20 UTC 2019] http://100.96.29.81:10254/healthz http status: 500::0.000016::0.000263::0.001097::0.001117
[Fri Jan 18 16:14:20 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000018::0.000684::0.002500::0.002515
[Fri Jan 18 16:14:21 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000019::0.000715::0.002270::0.002295
[Fri Jan 18 16:14:21 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000022::0.002707::0.005841::0.005860
[Fri Jan 18 16:14:21 UTC 2019] http://100.96.29.81:10254/healthz http status: 500::0.000016::0.000243::0.000742::0.000760
[Fri Jan 18 16:14:21 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000016::0.004325::0.006078::0.006114
[Fri Jan 18 16:14:22 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000029::0.000797::0.007521::0.007544
[Fri Jan 18 16:14:22 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000019::0.000263::0.003320::0.003340
[Fri Jan 18 16:14:22 UTC 2019] http://100.96.29.81:10254/healthz http status: 500::0.000016::0.000252::0.001245::0.001263
[Fri Jan 18 16:14:22 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000014::0.000680::0.002750::0.002767
[Fri Jan 18 16:14:23 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000021::0.000723::0.002503::0.002525
[Fri Jan 18 16:14:23 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000016::0.000238::1.813099::1.813174
[Fri Jan 18 16:14:25 UTC 2019] http://100.96.29.81:10254/healthz http status: 500::0.000017::0.000262::0.001793::0.001815
[Fri Jan 18 16:14:25 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000021::0.000799::0.002882::0.002901
[Fri Jan 18 16:14:26 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000020::0.000738::0.002451::0.002471
[Fri Jan 18 16:14:26 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000017::0.000259::3.507546::3.507583
[Fri Jan 18 16:14:29 UTC 2019] http://100.96.29.81:10254/healthz http status: 500::0.000029::0.000321::0.000890::0.000914
[Fri Jan 18 16:14:29 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000026::0.000910::0.004054::0.004078
[Fri Jan 18 16:14:30 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000028::0.000781::0.158132::0.158176
[Fri Jan 18 16:14:31 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000017::0.000326::2.694099::2.694130
[Fri Jan 18 16:14:33 UTC 2019] http://100.96.29.81:10254/healthz http status: 500::0.000023::0.000311::0.004423::0.004448
[Fri Jan 18 16:14:33 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000017::0.000742::0.003229::0.003247
[Fri Jan 18 16:14:34 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000024::0.000782::0.004251::0.004272
[Fri Jan 18 16:14:34 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000017::0.000241::2.955700::2.955730
[Fri Jan 18 16:14:37 UTC 2019] http://100.96.29.81:10254/healthz http status: 500::0.000019::0.000262::0.001195::0.001246
[Fri Jan 18 16:14:37 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000014::0.001047::0.004546::0.004565
[Fri Jan 18 16:14:38 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000020::0.000712::0.002325::0.002363
[Fri Jan 18 16:14:38 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000022::0.000268::2.350015::2.350049
[Fri Jan 18 16:14:41 UTC 2019] http://100.96.29.81:10254/healthz http status: 000::0.000023::0.000000::0.000000::0.000069
[Fri Jan 18 16:14:41 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000017::0.000712::0.005419::0.005438
[Fri Jan 18 16:14:42 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000020::0.001377::0.004157::0.004178
[Fri Jan 18 16:14:42 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000015::0.000240::2.529109::2.529148
[Fri Jan 18 16:14:44 UTC 2019] http://100.96.29.81:10254/healthz http status: 500::0.000019::0.000258::0.000976::0.000996
[Fri Jan 18 16:14:44 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000017::0.000737::0.007455::0.007473
[Fri Jan 18 16:14:45 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000021::0.000763::0.003782::0.003808
[Fri Jan 18 16:14:45 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000015::0.000321::1.605230::1.605260
[Fri Jan 18 16:14:47 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000019::0.000263::0.003015::0.003037
[Fri Jan 18 16:14:47 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000039::0.000741::0.006094::0.006114
[Fri Jan 18 16:14:48 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000018::0.000708::0.002475::0.002499
[Fri Jan 18 16:14:48 UTC 2019] http://100.96.29.19:10254/healthz http status: 500::0.000017::0.000250::0.000817::0.000833
[Fri Jan 18 16:14:48 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000040::0.000250::0.008481::0.008504
[Fri Jan 18 16:14:48 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000014::0.000694::0.004170::0.004189
[Fri Jan 18 16:14:49 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000017::0.000761::0.008205::0.008231
[Fri Jan 18 16:14:49 UTC 2019] http://100.96.29.19:10254/healthz http status: 500::0.000019::0.000414::0.001783::0.001802
[Fri Jan 18 16:14:49 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000014::0.000211::0.504140::0.504379
[Fri Jan 18 16:14:49 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000023::0.000717::0.004022::0.004043
[Fri Jan 18 16:14:50 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000018::0.000740::0.002447::0.002472
[Fri Jan 18 16:14:50 UTC 2019] http://100.96.29.19:10254/healthz http status: 500::0.000018::0.000275::0.001992::0.002008
[Fri Jan 18 16:14:50 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000020::0.000231::3.389365::3.389403
[Fri Jan 18 16:14:54 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000022::0.000727::0.004877::0.004900
[Fri Jan 18 16:14:55 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000020::0.000815::0.002551::0.002571
[Fri Jan 18 16:14:55 UTC 2019] http://100.96.29.19:10254/healthz http status: 500::0.000016::0.000264::0.000845::0.000871
[Fri Jan 18 16:14:55 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000016::0.000233::2.998702::2.998749
[Fri Jan 18 16:14:58 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000024::0.000761::0.011797::0.011820
[Fri Jan 18 16:14:59 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000018::0.000724::0.002941::0.002962
[Fri Jan 18 16:14:59 UTC 2019] http://100.96.29.19:10254/healthz http status: 500::0.000021::0.000262::0.001866::0.001886
[Fri Jan 18 16:14:59 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000025::0.000298::2.891728::2.891762
[Fri Jan 18 16:15:02 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000018::0.000716::0.002861::0.002882
[Fri Jan 18 16:15:03 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000025::0.002314::0.006595::0.006644
[Fri Jan 18 16:15:03 UTC 2019] http://100.96.29.19:10254/healthz http status: 500::0.000027::0.001592::0.002399::0.002424
[Fri Jan 18 16:15:03 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000028::0.000537::3.195221::3.195255
[Fri Jan 18 16:15:06 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000018::0.000738::0.002648::0.002672
[Fri Jan 18 16:15:07 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000022::0.000738::0.003204::0.003243
[Fri Jan 18 16:15:07 UTC 2019] http://100.96.29.19:10254/healthz http status: 500::0.000019::0.000262::0.002440::0.002463
[Fri Jan 18 16:15:07 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000020::0.000231::2.069995::2.070088
[Fri Jan 18 16:15:09 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000019::0.000710::0.002243::0.002261
[Fri Jan 18 16:15:10 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000020::0.000766::0.020720::0.020747
[Fri Jan 18 16:15:10 UTC 2019] http://100.96.29.19:10254/healthz http status: 000::0.000019::0.000000::0.000000::0.000069
[Fri Jan 18 16:15:10 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000019::0.000242::2.151019::2.151056
[Fri Jan 18 16:15:12 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000019::0.000688::0.002563::0.002586
[Fri Jan 18 16:15:13 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000020::0.000727::0.008094::0.008117
[Fri Jan 18 16:15:13 UTC 2019] http://100.96.29.19:10254/healthz http status: 500::0.000016::0.000286::0.001543::0.001564
[Fri Jan 18 16:15:13 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000014::0.000244::1.980282::1.980323
[Fri Jan 18 16:15:15 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000055::0.000778::0.004642::0.004668
[Fri Jan 18 16:15:16 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000025::0.000897::0.003045::0.003076
[Fri Jan 18 16:15:16 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000022::0.000323::0.006020::0.006047
[Fri Jan 18 16:15:16 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000022::0.000308::2.500127::2.500182
[Fri Jan 18 16:15:19 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000020::0.000723::0.007644::0.007663
[Fri Jan 18 16:15:20 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000028::0.000737::0.002864::0.002886
[Fri Jan 18 16:15:20 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000022::0.000280::0.003027::0.003052
[Fri Jan 18 16:15:20 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000028::0.000303::1.673258::1.673299
[Fri Jan 18 16:15:22 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000034::0.002531::0.004833::0.004939
[Fri Jan 18 16:15:23 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000031::0.000776::0.007756::0.007789
[Fri Jan 18 16:15:23 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000026::0.000483::0.005270::0.005308
[Fri Jan 18 16:15:23 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000028::0.000281::0.274060::0.274095
[Fri Jan 18 16:15:23 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000025::0.000857::0.020886::0.020911
[Fri Jan 18 16:15:24 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000024::0.000709::0.002947::0.002970
[Fri Jan 18 16:15:24 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000018::0.000241::3.363062::3.363098
[Fri Jan 18 16:15:27 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000022::0.001813::0.009423::0.009447
[Fri Jan 18 16:15:27 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000023::0.000943::0.037287::0.037332
[Fri Jan 18 16:15:28 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000020::0.000745::0.003082::0.003106

Some of the 500s mixed in are from when the pod was taken out of service from the k8s probes, but interesting to see that health checks sometimes take up to 3.5s.

Globegitter on 18 Jan 2019

@Globegitter please check the log of the pod with IP 100.96.29.19.

aledbf on 18 Jan 2019

Just checked, we do not have a logging collector active yet for the nginx logs so only see part of the logs of the previous pod that errored. I do see a few 2019/01/18 16:34:00 [error] 139#139: *12429 upstream prematurely closed connection while reading response header from upstream, client: xxxxxx, server: xxxxx, request: "GET /about HTTP/2.0", upstream: "http://100.96.29.102:8001/about", host: "xxxxx", referrer: "xxxxxxx"

and at the end:

I0118 16:34:08.456836       7 nginx.go:385] NGINX process has stopped
I0118 16:34:08.456938       7 main.go:175] Handled quit, awaiting Pod deletion
I0118 16:34:18.457336       7 main.go:178] Exiting with 0

Otherwise it is just 900+ lines of normal json log of requests being made.

I can take a look next week at exporting the logs, or tailing them at the same time. Would you expect to see something in the error log? Or just the normal response logging?

Globegitter on 18 Jan 2019

This issue is hard to reproduce, for this reason, I created https://github.com/kubernetes/ingress-nginx/pull/3684 to see if we can narrow the scope issue. Basically, I added additional logs in the health check inside the ingress controller to detect where we receive an error.

Please, help us using the image quay.io/kubernetes-ingress-controller/nginx-ingress-controller:dev

Note: for security reasons, the probes do not show the exact cause of the failure unless you increase the log level of kubelet to a value higher than five (this is the reason why I added the additional logs)

Note 2: this PR will never be merged for the same reason, it's just to track the reason

aledbf on 21 Jan 2019

@aledbf Thanks I will start investigating now, on Friday we could resolve the issue by increasing the number of replicas. It started happening again. Will test this image.

Globegitter on 21 Jan 2019

👍1

So, we have some output:

E0121 15:17:44.742314       7 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
...
...
I0121 15:17:48.427437       7 nginx.go:385] NGINX process has stopped
I0121 15:17:48.428097       7 main.go:175] Handled quit, awaiting Pod deletion
E0121 15:17:54.742273       7 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
I0121 15:17:58.428663       7 main.go:178] Exiting with 0

for another pod:

E0121 15:15:06.599845       6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
...
...
I0121 15:15:09.895122       6 nginx.go:385] NGINX process has stopped
I0121 15:15:09.895267       6 main.go:175] Handled quit, awaiting Pod deletion
E0121 15:15:16.600000       6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
I0121 15:15:19.896690       6 main.go:178] Exiting with 0

But again that is just from the logs that I could tail, but at least it is giving us something more. Also I am currently tailing all running pods with a grep for healthcheck so if this happens again we should see any of the logs there.

Globegitter on 21 Jan 2019

Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080

This is "normal" because the health check starts before nginx. You should see only one of this error in the log.

E0121 15:17:44.742314
....
E0121 15:17:54.742273

This is not ok. This output means the ingress controller startup (nginx binary) takes more than ten seconds for the initial sync, which is not normal.

aledbf on 21 Jan 2019

yep all logs the same:

E0121 15:37:26.600460       6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
public-nginx-ingress-controller-6d4c798cf6-ggn46 nginx-ingress-controller E0121 15:37:36.601935       6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
public-nginx-ingress-controller-6d4c798cf6-ggn46 nginx-ingress-controller E0121 15:37:46.600979       6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
public-nginx-ingress-controller-6d4c798cf6-ggn46 nginx-ingress-controller E0121 15:37:56.600502       6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
public-nginx-ingress-controller-6d4c798cf6-ggn46 nginx-ingress-controller E0121 15:38:06.600290       6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
public-nginx-ingress-controller-6d4c798cf6-ggn46 nginx-ingress-controller public-nginx-ingress-controller-6d4c798cf6-tt7z9 E0121 15:38:16.600088       6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
public-nginx-ingress-controller-6d4c798cf6-kll8s nginx-ingress-controller E0121 15:38:35.487322       6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
E0121 15:38:45.536244       6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
E0121 15:38:55.487948       6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
public-nginx-ingress-controller-6d4c798cf6-kll8s nginx-ingress-controller E0121 15:39:05.487299       6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused

Here again, I just looked over the logs of all the running pods and jsut grepped for healthcheck.

Edit: Ah interesting, I will try and keep an eye out specifically on startup - some of these definitely happen to instances that have been running well for a while but we have not looked at startup "issues" yet.

Globegitter on 21 Jan 2019

@Globegitter please check and post the generated nginx.conf

kubectl exec -n ingress-nginx <pod> -it cat nginx.conf |grep 18080

Is IPV6 enabled? (or only IPV6)
What CNI provider are you using?

aledbf on 21 Jan 2019

To those affected by this issue:

Please help us to test a fix for this with https://github.com/kubernetes/ingress-nginx/pull/3684 using the image
quay.io/kubernetes-ingress-controller/nginx-ingress-controller:dev

The mentioned PR contains a refactoring of the nginx server used for health-check and Lua configuration replacing the TCP port with a unix socket.

aledbf on 26 Jan 2019

👍1

Hi, we were affected by this issue. We saw 150 connections per second would cause our ingress controller to restart. When we checked the resource it was failing health checks just as described here in this issue. We would sometimes see timeouts as high as 10.x seconds.

We tested again with:
quay.io/kubernetes-ingress-controller/nginx-ingress-controller:dev

This dev image fully resolved our issue, we are no longer seeing restarts at high connection rates.

mdgreenwald on 4 Feb 2019

Do I get it right, https://github.com/kubernetes/ingress-nginx/pull/4531 (shipped with 0.26.0) practically reverted changes made for this issue in https://github.com/kubernetes/ingress-nginx/pull/3684 (shipped with 0.23.0) ?

sslavic on 6 Jul 2020

@sslavic yes

aledbf on 6 Jul 2020

This could improve situation https://github.com/kubernetes/ingress-nginx/pull/5832 once released.

sslavic on 6 Jul 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Ingress IP not equal to the Nginx ingress controller on GKE

briananstett · 3Comments

nginx ingress not respecting http-snippet to remove Server header

c-mccutcheon · 3Comments

Feature request: Add ingress annotation to enable upstream "keepalive" option

vdavidoff · 3Comments

Setting nginx 'proxy-buffer-size' to 16k causes nginx config error

whereisaaron · 3Comments

difference between service-loadbalancer and ingress

jwfang · 3Comments