Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/.):
What keywords did you search in NGINX Ingress controller issues before filing this one? (If you have found any duplicates, you should instead reply there.):
Is this a BUG REPORT or FEATURE REQUEST? (choose one):
NGINX Ingress controller version:
0.20.0
Kubernetes version (use kubectl version):
1.9.6
Environment:
Production
uname -a):What happened:
After upgrading the Nginx controller version from 0.15.0 to 0.20.0, the nginx ingress controller pods are regularly crashing after several timeout messages on the liveness probe. The nginx ingress controller pods are installed on separate VMs as all other pods. We need 0.20.0 version because we want to activate use-forwarded-headers: "false" in nginx config map to avoid the security vulnerability (user forging the headers to bypass the whitelist of nginx).
What you expected to happen:
Stable behavior of nginx ingress controller pods as in version 0.15.0.
How to reproduce it (as minimally and precisely as possible):
Update the image quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.15.0 to
quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.20.0 on existent nginx ingress controller deployment.
Anything else we need to know:
Logs from the events:
2018-11-21 17:24:25 +0100 CET 2018-11-21 17:23:05 +0100 CET 6 nginx-ingress-controller-7d47db4569-9bxtz.1569303cf3aebbba Pod spec.containers{nginx-ingress-controller} Warning Unhealthy kubelet, k8s-dmz-81594228-0 Liveness probe failed: Get http://xx.xx.xx.xx:10254/healthz:: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2018-11-21 17:24:26 +0100 CET 2018-11-21 17:24:26 +0100 CET 1 nginx-ingress-controller-7d47db4569-9bxtz.1569304fae92655c Pod spec.containers{nginx-ingress-controller} Normal Killing kubelet, k8s-dmz-81594228-0 Killing container with id docker://nginx-ingress-controller:Container failed liveness probe.. Container will be killed and recreated.
We have tried to increase the timeoutSeconds on liveness probe to 4s and also to add - --enable-dynamic-configuration=false in the nginx deployment.
With this configuration, the number of timeouts decreased, but after a certain charge from apps on the platform, the timeouts become more regular.
Logs from nginx pods in debug mode and timeout 3sec:
{"log":"E1121 13:30:34.808413 5 controller.go:232] Error getting ConfigMap \"kube-system/udp-services\": no object matching key \"kube-system/udp-services\" in local store\n","stream":"stderr","time":"2018-11-21T13:30:34.818557076Z"}
{"log":"I1121 13:30:37.500168 5 main.go:158] Received SIGTERM, shutting down\n","stream":"stderr","time":"2018-11-21T13:30:37.501123038Z"}
{"log":"I1121 13:30:37.500229 5 nginx.go:340] Shutting down controller queues\n","stream":"stderr","time":"2018-11-21T13:30:37.501167238Z"}
{"log":"I1121 13:30:37.500276 5 nginx.go:348] Stopping NGINX process\n","stream":"stderr","time":"2018-11-21T13:30:37.501203538Z"}
@EcaterinaGr please post the start of the log. Maybe you are being affected by you should follow https://github.com/Azure/AKS/issues/435
@aledbf , are you asking for the start of events logs or of the pod itself?
Actually before SIGTERM, we can see the normal logs on the requests done on controllers and no other errors. The SIGTERM is given by the Kubenet (as we can see from the events) but we cannot identify the reason.
We are not using AKS, we use ACS engine and nginx was working well on our platform in the previous version 0.15.0.
Grabbed more logs from the docker container itself before it was killed:
2018/11/21 19:30:15 [alert] 2187#2187: *84403 open socket #47 left in connection 19
2018/11/21 19:30:15 [alert] 2187#2187: aborting
W1121 19:30:22.655014 6 controller.go:773] service utm-demo/todoapi-svc-demo does not have any active endpoints
W1121 19:30:22.655191 6 controller.go:773] service aviation-data-prod/rawstorage-service does not have any active endpoints
W1121 19:30:22.655466 6 controller.go:773] service atfm-dev/innove-gateway-service does not have any active endpoints
W1121 19:30:22.657725 6 controller.go:773] service utm-dev/utm-backend-api-svc-test does not have any active endpoints
W1121 19:30:22.658510 6 controller.go:1020] ssl certificate "aviation-data-prod/aviation-data-prod-tls" does not exist in local store
W1121 19:30:22.658665 6 controller.go:1020] ssl certificate "aviation-data-prod/aviation-data-prod-tls" does not exist in local store
W1121 19:30:22.658768 6 controller.go:1020] ssl certificate "aviation-data-prod/aviation-data-prod-tls" does not exist in local store
W1121 19:30:22.659129 6 controller.go:1020] ssl certificate "aviation-data-prod/aviation-data-prod-tls" does not exist in local store
E1121 19:30:22.660605 6 controller.go:221] unexpected error reading configmap kube-system/tcp-services: configmap kube-system/tcp-services was not found
E1121 19:30:22.660670 6 controller.go:221] unexpected error reading configmap kube-system/udp-services: configmap kube-system/udp-services was not found
I1121 19:30:25.914652 6 main.go:127] Received SIGTERM, shutting down
I1121 19:30:25.914678 6 nginx.go:362] shutting down controller queues
I1121 19:30:25.914709 6 status.go:119] updating status of Ingress rules (remove)
I1121 19:30:25.954696 6 nginx.go:370] stopping NGINX process...
2018/11/21 19:30:25 [warn] 2401#2401: duplicate network "0.0.0.0/0", value: "0", old value: "1" in /etc/nginx/nginx.conf:1400
nginx: [warn] duplicate network "0.0.0.0/0", value: "0", old value: "1" in /etc/nginx/nginx.conf:1400
2018/11/21 19:30:25 [warn] 2401#2401: duplicate network "0.0.0.0/0", value: "0", old value: "1" in /etc/nginx/nginx.conf:1416
nginx: [warn] duplicate network "0.0.0.0/0", value: "0", old value: "1" in /etc/nginx/nginx.conf:1416
2018/11/21 19:30:25 [warn] 2401#2401: duplicate network "0.0.0.0/0", value: "0", old value: "1" in /etc/nginx/nginx.conf:8990
nginx: [warn] duplicate network "0.0.0.0/0", value: "0", old value: "1" in /etc/nginx/nginx.conf:8990
2018/11/21 19:30:25 [warn] 2401#2401: duplicate network "0.0.0.0/0", value: "0", old value: "1" in /etc/nginx/nginx.conf:9006
nginx: [warn] duplicate network "0.0.0.0/0", value: "0", old value: "1" in /etc/nginx/nginx.conf:9006
2018/11/21 19:30:25 [notice] 2401#2401: signal process started
Hi,
I wanted to note, that I had the same constant crashing with 0.20 on GCP. I updated to 0.21 just now, and it is better, but the crashes still occur. This time I saw in the logs:
I1123 10:25:15.605676 9 backend_ssl.go:189] Updating local copy of SSL certificate "default/prow-tls" with missing intermediate CA certs
I1123 10:29:47.579872 9 main.go:163] Received SIGTERM, shutting down
I1123 10:29:47.579901 9 nginx.go:343] Shutting down controller queues
I1123 10:29:47.579917 9 status.go:200] updating status of Ingress rules (remove)
Related to this: https://github.com/kubernetes/ingress-nginx/issues/2833
hello
Any updates regarding the issue?
As soon as we add certificates (same creation as on our dev clusters) nginx starts to crash.
As long as one is alive it actually works! But is does not take too long to finally crash all instances :/
We do not observer this on Kubernetes 1.11.3 but on 1.12.3
solution for this bug ? plz thanks
We have been experiencing similar issues where we see:
Warning Unhealthy 8m5s (x12 over 20m) kubelet, ip-xxx-eu-central-1.compute.internal Readiness probe failed: Get http://100.96.5.100:10254/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 8m1s (x11 over 20m) kubelet, ip-xxx.eu-central-1.compute.internal Liveness probe failed: Get http://100.96.5.100:10254/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
much more frequently for some of our nginx pods than we did in the past and it is leading to public facing errors, sometimes it seems to manifest as cert errors (we do TLS termination with nginx). I don't know exactly when it started but we started noticing it when we where on 0.20.0 and now we are on 0.22.0 and we are still seeing these issues.
Our probes timeout is set to 1s and we also do not seem to see any error in the nginx controller itself. And as of this time we have no idea why this is happening.
We are running on AWS with a kops cluster on k8s 1.11.6.
@Globegitter how does memory and cpu utilization look like during that time? It could be network related too.
If it is feasible you can also try to downgrade to the earlier ingress version where it worked well for you and confirm that it is because if ingress-nginx.
@ElvinEfendi You mean specifically of these nginx pods? Or you mean node/cluster wide? But yeah we still have to investigate further and we have been serving quite a bit more traffic since over a month ago. I just found this issues seemed interestingly fitting but could very much just be coincidence.
Unfortunately we are using a 0.20.0 specific feature so downgrading is going to be a bit more difficult, but depending on how this continues we'll definitely be doing some investigation on that next week.
You mean specifically of these nginx pods
I guess both. I'm suggesting that to make sure timeout is not because the pod is too busy doing other things and therefore can not respond health checks timely.
@ElvinEfendi yeah that is a good pointer: The last 2 hours or so we only ever had on pod crashing a fair bit. and it is that pod that has a much higher cpu usage and memory usage than the other pods. So that is very interesting, as to why that one pod would have a much higher usage than our other 3 pods? Further we also have a cluster wide traffic of <2000req/s, so even if that was just hitting one nginx I would still expect that to be able to respond to health probes.
@Globegitter maybe you can help us to test this running some scripts :)
kubectl get pod -o wide -n ingress-nginx
SSH to a node in the cluster
Build a list with the IP addresses and run a script
POD_IPS="XX.XXX.XXX XX.XXX.XXX" # using the IPs from kubectl
while true;do
sleep 1
for x in $POD_IPS;do
echo "[$(date)] http://$x:10254/healthz http status:" $(curl -I -o /dev/null -s -w "%{http_code}::%{time_namelookup}::%{time_connect}::%{time_starttransfer}::%{time_total}\n" http://$x:10254/healthz);
done
done
This will print something like
[Fri Jan 18 15:47:33 UTC 2019] http://100.96.9.248:10254/healthz http status: 200::0.000029::0.000112::0.001546::0.001582
[Fri Jan 18 15:47:33 UTC 2019] http://100.96.3.200:10254/healthz http status: 200::0.000023::0.000817::0.002746::0.002782
This can help us to test what @ElvinEfendi said in the previous comment
Edit: maybe I should put this in a k8s Job to help to debug this issue.
@aledbf Aah yeah that is a good idea on debugging - we just raised the memory limit and the pods are re-rolling. But if the timeout issue persists / pops up again I will make sure to make use of the debug script and post the results here.
Edit: And a job for that would really be useful.
@aledbf it is still happening, here some output:
[Fri Jan 18 16:14:08 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000031::0.003766::0.011774::0.011802
[Fri Jan 18 16:14:08 UTC 2019] http://100.96.29.19:10254/healthz http status: 000::0.000026::0.000000::0.000000::0.000084
[Fri Jan 18 16:14:08 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000022::0.000289::2.548471::2.548508
[Fri Jan 18 16:14:10 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000018::0.000719::0.010645::0.010668
[Fri Jan 18 16:14:12 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000021::0.000742::0.002636::0.002657
[Fri Jan 18 16:14:12 UTC 2019] http://100.96.29.19:10254/healthz http status: 500::0.000024::0.000305::0.001203::0.001222
[Fri Jan 18 16:14:12 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000017::0.000239::2.434425::2.434456
[Fri Jan 18 16:14:14 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000017::0.000700::0.003581::0.003603
[Fri Jan 18 16:14:15 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000023::0.000735::0.002499::0.002522
[Fri Jan 18 16:14:15 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000017::0.000260::0.004873::0.004893
[Fri Jan 18 16:14:15 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000015::0.000253::2.936678::2.936709
[Fri Jan 18 16:14:18 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000024::0.000730::0.005554::0.005575
[Fri Jan 18 16:14:19 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000021::0.000835::0.004889::0.004912
[Fri Jan 18 16:14:19 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000029::0.000293::0.004889::0.004909
[Fri Jan 18 16:14:19 UTC 2019] http://100.96.29.81:10254/healthz http status: 500::0.000015::0.000245::0.000943::0.000963
[Fri Jan 18 16:14:19 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000021::0.001020::0.003824::0.003846
[Fri Jan 18 16:14:20 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000018::0.000713::0.004364::0.004391
[Fri Jan 18 16:14:20 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000036::0.000363::0.001800::0.001818
[Fri Jan 18 16:14:20 UTC 2019] http://100.96.29.81:10254/healthz http status: 500::0.000016::0.000263::0.001097::0.001117
[Fri Jan 18 16:14:20 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000018::0.000684::0.002500::0.002515
[Fri Jan 18 16:14:21 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000019::0.000715::0.002270::0.002295
[Fri Jan 18 16:14:21 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000022::0.002707::0.005841::0.005860
[Fri Jan 18 16:14:21 UTC 2019] http://100.96.29.81:10254/healthz http status: 500::0.000016::0.000243::0.000742::0.000760
[Fri Jan 18 16:14:21 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000016::0.004325::0.006078::0.006114
[Fri Jan 18 16:14:22 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000029::0.000797::0.007521::0.007544
[Fri Jan 18 16:14:22 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000019::0.000263::0.003320::0.003340
[Fri Jan 18 16:14:22 UTC 2019] http://100.96.29.81:10254/healthz http status: 500::0.000016::0.000252::0.001245::0.001263
[Fri Jan 18 16:14:22 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000014::0.000680::0.002750::0.002767
[Fri Jan 18 16:14:23 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000021::0.000723::0.002503::0.002525
[Fri Jan 18 16:14:23 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000016::0.000238::1.813099::1.813174
[Fri Jan 18 16:14:25 UTC 2019] http://100.96.29.81:10254/healthz http status: 500::0.000017::0.000262::0.001793::0.001815
[Fri Jan 18 16:14:25 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000021::0.000799::0.002882::0.002901
[Fri Jan 18 16:14:26 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000020::0.000738::0.002451::0.002471
[Fri Jan 18 16:14:26 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000017::0.000259::3.507546::3.507583
[Fri Jan 18 16:14:29 UTC 2019] http://100.96.29.81:10254/healthz http status: 500::0.000029::0.000321::0.000890::0.000914
[Fri Jan 18 16:14:29 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000026::0.000910::0.004054::0.004078
[Fri Jan 18 16:14:30 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000028::0.000781::0.158132::0.158176
[Fri Jan 18 16:14:31 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000017::0.000326::2.694099::2.694130
[Fri Jan 18 16:14:33 UTC 2019] http://100.96.29.81:10254/healthz http status: 500::0.000023::0.000311::0.004423::0.004448
[Fri Jan 18 16:14:33 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000017::0.000742::0.003229::0.003247
[Fri Jan 18 16:14:34 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000024::0.000782::0.004251::0.004272
[Fri Jan 18 16:14:34 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000017::0.000241::2.955700::2.955730
[Fri Jan 18 16:14:37 UTC 2019] http://100.96.29.81:10254/healthz http status: 500::0.000019::0.000262::0.001195::0.001246
[Fri Jan 18 16:14:37 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000014::0.001047::0.004546::0.004565
[Fri Jan 18 16:14:38 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000020::0.000712::0.002325::0.002363
[Fri Jan 18 16:14:38 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000022::0.000268::2.350015::2.350049
[Fri Jan 18 16:14:41 UTC 2019] http://100.96.29.81:10254/healthz http status: 000::0.000023::0.000000::0.000000::0.000069
[Fri Jan 18 16:14:41 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000017::0.000712::0.005419::0.005438
[Fri Jan 18 16:14:42 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000020::0.001377::0.004157::0.004178
[Fri Jan 18 16:14:42 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000015::0.000240::2.529109::2.529148
[Fri Jan 18 16:14:44 UTC 2019] http://100.96.29.81:10254/healthz http status: 500::0.000019::0.000258::0.000976::0.000996
[Fri Jan 18 16:14:44 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000017::0.000737::0.007455::0.007473
[Fri Jan 18 16:14:45 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000021::0.000763::0.003782::0.003808
[Fri Jan 18 16:14:45 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000015::0.000321::1.605230::1.605260
[Fri Jan 18 16:14:47 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000019::0.000263::0.003015::0.003037
[Fri Jan 18 16:14:47 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000039::0.000741::0.006094::0.006114
[Fri Jan 18 16:14:48 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000018::0.000708::0.002475::0.002499
[Fri Jan 18 16:14:48 UTC 2019] http://100.96.29.19:10254/healthz http status: 500::0.000017::0.000250::0.000817::0.000833
[Fri Jan 18 16:14:48 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000040::0.000250::0.008481::0.008504
[Fri Jan 18 16:14:48 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000014::0.000694::0.004170::0.004189
[Fri Jan 18 16:14:49 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000017::0.000761::0.008205::0.008231
[Fri Jan 18 16:14:49 UTC 2019] http://100.96.29.19:10254/healthz http status: 500::0.000019::0.000414::0.001783::0.001802
[Fri Jan 18 16:14:49 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000014::0.000211::0.504140::0.504379
[Fri Jan 18 16:14:49 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000023::0.000717::0.004022::0.004043
[Fri Jan 18 16:14:50 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000018::0.000740::0.002447::0.002472
[Fri Jan 18 16:14:50 UTC 2019] http://100.96.29.19:10254/healthz http status: 500::0.000018::0.000275::0.001992::0.002008
[Fri Jan 18 16:14:50 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000020::0.000231::3.389365::3.389403
[Fri Jan 18 16:14:54 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000022::0.000727::0.004877::0.004900
[Fri Jan 18 16:14:55 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000020::0.000815::0.002551::0.002571
[Fri Jan 18 16:14:55 UTC 2019] http://100.96.29.19:10254/healthz http status: 500::0.000016::0.000264::0.000845::0.000871
[Fri Jan 18 16:14:55 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000016::0.000233::2.998702::2.998749
[Fri Jan 18 16:14:58 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000024::0.000761::0.011797::0.011820
[Fri Jan 18 16:14:59 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000018::0.000724::0.002941::0.002962
[Fri Jan 18 16:14:59 UTC 2019] http://100.96.29.19:10254/healthz http status: 500::0.000021::0.000262::0.001866::0.001886
[Fri Jan 18 16:14:59 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000025::0.000298::2.891728::2.891762
[Fri Jan 18 16:15:02 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000018::0.000716::0.002861::0.002882
[Fri Jan 18 16:15:03 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000025::0.002314::0.006595::0.006644
[Fri Jan 18 16:15:03 UTC 2019] http://100.96.29.19:10254/healthz http status: 500::0.000027::0.001592::0.002399::0.002424
[Fri Jan 18 16:15:03 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000028::0.000537::3.195221::3.195255
[Fri Jan 18 16:15:06 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000018::0.000738::0.002648::0.002672
[Fri Jan 18 16:15:07 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000022::0.000738::0.003204::0.003243
[Fri Jan 18 16:15:07 UTC 2019] http://100.96.29.19:10254/healthz http status: 500::0.000019::0.000262::0.002440::0.002463
[Fri Jan 18 16:15:07 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000020::0.000231::2.069995::2.070088
[Fri Jan 18 16:15:09 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000019::0.000710::0.002243::0.002261
[Fri Jan 18 16:15:10 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000020::0.000766::0.020720::0.020747
[Fri Jan 18 16:15:10 UTC 2019] http://100.96.29.19:10254/healthz http status: 000::0.000019::0.000000::0.000000::0.000069
[Fri Jan 18 16:15:10 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000019::0.000242::2.151019::2.151056
[Fri Jan 18 16:15:12 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000019::0.000688::0.002563::0.002586
[Fri Jan 18 16:15:13 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000020::0.000727::0.008094::0.008117
[Fri Jan 18 16:15:13 UTC 2019] http://100.96.29.19:10254/healthz http status: 500::0.000016::0.000286::0.001543::0.001564
[Fri Jan 18 16:15:13 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000014::0.000244::1.980282::1.980323
[Fri Jan 18 16:15:15 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000055::0.000778::0.004642::0.004668
[Fri Jan 18 16:15:16 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000025::0.000897::0.003045::0.003076
[Fri Jan 18 16:15:16 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000022::0.000323::0.006020::0.006047
[Fri Jan 18 16:15:16 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000022::0.000308::2.500127::2.500182
[Fri Jan 18 16:15:19 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000020::0.000723::0.007644::0.007663
[Fri Jan 18 16:15:20 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000028::0.000737::0.002864::0.002886
[Fri Jan 18 16:15:20 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000022::0.000280::0.003027::0.003052
[Fri Jan 18 16:15:20 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000028::0.000303::1.673258::1.673299
[Fri Jan 18 16:15:22 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000034::0.002531::0.004833::0.004939
[Fri Jan 18 16:15:23 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000031::0.000776::0.007756::0.007789
[Fri Jan 18 16:15:23 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000026::0.000483::0.005270::0.005308
[Fri Jan 18 16:15:23 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000028::0.000281::0.274060::0.274095
[Fri Jan 18 16:15:23 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000025::0.000857::0.020886::0.020911
[Fri Jan 18 16:15:24 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000024::0.000709::0.002947::0.002970
[Fri Jan 18 16:15:24 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000018::0.000241::3.363062::3.363098
[Fri Jan 18 16:15:27 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000022::0.001813::0.009423::0.009447
[Fri Jan 18 16:15:27 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000023::0.000943::0.037287::0.037332
[Fri Jan 18 16:15:28 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000020::0.000745::0.003082::0.003106
Some of the 500s mixed in are from when the pod was taken out of service from the k8s probes, but interesting to see that health checks sometimes take up to 3.5s.
@Globegitter please check the log of the pod with IP 100.96.29.19.
Just checked, we do not have a logging collector active yet for the nginx logs so only see part of the logs of the previous pod that errored. I do see a few 2019/01/18 16:34:00 [error] 139#139: *12429 upstream prematurely closed connection while reading response header from upstream, client: xxxxxx, server: xxxxx, request: "GET /about HTTP/2.0", upstream: "http://100.96.29.102:8001/about", host: "xxxxx", referrer: "xxxxxxx"
and at the end:
I0118 16:34:08.456836 7 nginx.go:385] NGINX process has stopped
I0118 16:34:08.456938 7 main.go:175] Handled quit, awaiting Pod deletion
I0118 16:34:18.457336 7 main.go:178] Exiting with 0
Otherwise it is just 900+ lines of normal json log of requests being made.
I can take a look next week at exporting the logs, or tailing them at the same time. Would you expect to see something in the error log? Or just the normal response logging?
This issue is hard to reproduce, for this reason, I created https://github.com/kubernetes/ingress-nginx/pull/3684 to see if we can narrow the scope issue. Basically, I added additional logs in the health check inside the ingress controller to detect where we receive an error.
Please, help us using the image quay.io/kubernetes-ingress-controller/nginx-ingress-controller:dev
Note: for security reasons, the probes do not show the exact cause of the failure unless you increase the log level of kubelet to a value higher than five (this is the reason why I added the additional logs)
Note 2: this PR will never be merged for the same reason, it's just to track the reason
@aledbf Thanks I will start investigating now, on Friday we could resolve the issue by increasing the number of replicas. It started happening again. Will test this image.
So, we have some output:
E0121 15:17:44.742314 7 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
...
...
I0121 15:17:48.427437 7 nginx.go:385] NGINX process has stopped
I0121 15:17:48.428097 7 main.go:175] Handled quit, awaiting Pod deletion
E0121 15:17:54.742273 7 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
I0121 15:17:58.428663 7 main.go:178] Exiting with 0
for another pod:
E0121 15:15:06.599845 6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
...
...
I0121 15:15:09.895122 6 nginx.go:385] NGINX process has stopped
I0121 15:15:09.895267 6 main.go:175] Handled quit, awaiting Pod deletion
E0121 15:15:16.600000 6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
I0121 15:15:19.896690 6 main.go:178] Exiting with 0
But again that is just from the logs that I could tail, but at least it is giving us something more. Also I am currently tailing all running pods with a grep for healthcheck so if this happens again we should see any of the logs there.
Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080
This is "normal" because the health check starts before nginx. You should see only one of this error in the log.
E0121 15:17:44.742314
....
E0121 15:17:54.742273
This is not ok. This output means the ingress controller startup (nginx binary) takes more than ten seconds for the initial sync, which is not normal.
yep all logs the same:
E0121 15:37:26.600460 6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
public-nginx-ingress-controller-6d4c798cf6-ggn46 nginx-ingress-controller E0121 15:37:36.601935 6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
public-nginx-ingress-controller-6d4c798cf6-ggn46 nginx-ingress-controller E0121 15:37:46.600979 6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
public-nginx-ingress-controller-6d4c798cf6-ggn46 nginx-ingress-controller E0121 15:37:56.600502 6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
public-nginx-ingress-controller-6d4c798cf6-ggn46 nginx-ingress-controller E0121 15:38:06.600290 6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
public-nginx-ingress-controller-6d4c798cf6-ggn46 nginx-ingress-controller public-nginx-ingress-controller-6d4c798cf6-tt7z9 E0121 15:38:16.600088 6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
public-nginx-ingress-controller-6d4c798cf6-kll8s nginx-ingress-controller E0121 15:38:35.487322 6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
E0121 15:38:45.536244 6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
E0121 15:38:55.487948 6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
public-nginx-ingress-controller-6d4c798cf6-kll8s nginx-ingress-controller E0121 15:39:05.487299 6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
Here again, I just looked over the logs of all the running pods and jsut grepped for healthcheck.
Edit: Ah interesting, I will try and keep an eye out specifically on startup - some of these definitely happen to instances that have been running well for a while but we have not looked at startup "issues" yet.
@Globegitter please check and post the generated nginx.conf
kubectl exec -n ingress-nginx <pod> -it cat nginx.conf |grep 18080
Is IPV6 enabled? (or only IPV6)
What CNI provider are you using?
To those affected by this issue:
Please help us to test a fix for this with https://github.com/kubernetes/ingress-nginx/pull/3684 using the image
quay.io/kubernetes-ingress-controller/nginx-ingress-controller:dev
The mentioned PR contains a refactoring of the nginx server used for health-check and Lua configuration replacing the TCP port with a unix socket.
Hi, we were affected by this issue. We saw 150 connections per second would cause our ingress controller to restart. When we checked the resource it was failing health checks just as described here in this issue. We would sometimes see timeouts as high as 10.x seconds.
We tested again with:
quay.io/kubernetes-ingress-controller/nginx-ingress-controller:dev
This dev image fully resolved our issue, we are no longer seeing restarts at high connection rates.
Do I get it right, https://github.com/kubernetes/ingress-nginx/pull/4531 (shipped with 0.26.0) practically reverted changes made for this issue in https://github.com/kubernetes/ingress-nginx/pull/3684 (shipped with 0.23.0) ?
@sslavic yes
This could improve situation https://github.com/kubernetes/ingress-nginx/pull/5832 once released.