TL;DR;
jump to https://github.com/kubernetes/ingress-nginx/issues/4735#issuecomment-551188265
kubectl --context=etingroup-production -n nginx-ingress logs -f nginx-ingress-controller-f8f7c9b69-r9rbq
-------------------------------------------------------------------------------
NGINX Ingress controller
Release: 0.26.1
Build: git-2de5a893a
Repository: https://github.com/kubernetes/ingress-nginx
nginx version: openresty/1.15.8.2
-------------------------------------------------------------------------------
I1104 15:31:13.442249 6 flags.go:198] Watching for Ingress class: nginx
W1104 15:31:13.442782 6 flags.go:243] SSL certificate chain completion is disabled (--enable-ssl-chain-completion=false)
W1104 15:31:13.443033 6 client_config.go:541] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
I1104 15:31:13.443404 6 main.go:182] Creating API client for https://10.55.240.1:443
I1104 15:31:13.457511 6 main.go:226] Running in Kubernetes cluster version v1.14+ (v1.14.7-gke.10) - git (clean) commit 8cea5f8ae165065f0d35e5de5dfa2f73617f02d1 - platform linux/amd64
I1104 15:31:13.466100 6 main.go:90] Validated nginx-ingress/nginx-ingress-default-backend as the default backend.
I1104 15:31:27.613232 6 main.go:101] SSL fake certificate created /etc/ingress-controller/ssl/default-fake-certificate.pem
W1104 15:31:35.091635 6 store.go:635] Unexpected error reading configuration configmap: configmaps "nginx-ingress-controller" not found
I1104 15:31:39.614309 6 nginx.go:263] Starting NGINX Ingress controller
I1104 15:31:47.906011 6 backend_ssl.go:66] Adding Secret "argocd/argocd-secret" to the local store
I1104 15:31:47.911764 6 nginx.go:307] Starting NGINX process
I1104 15:31:47.917560 6 leaderelection.go:241] attempting to acquire leader lease nginx-ingress/ingress-controller-leader-nginx...
I1104 15:31:47.923784 6 backend_ssl.go:66] Adding Secret "etingroup/main-etingroup-pl-tls" to the local store
I1104 15:31:47.929354 6 event.go:255] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"etingroup", Name:"etingroup-sync", UID:"35893ce3-84ae-11e9-90aa-42010a840106", APIVersion:"networking.k8s.io/v1beta1", ResourceVersion:"49361143", FieldPath:""}): type: 'Normal' reason: 'CREATE' Ingress etingroup/etingroup-sync
I1104 15:31:47.933374 6 event.go:255] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"argocd", Name:"argocd-server-http-ingress", UID:"1e332eb9-8451-11e9-9fe8-42010a8400b2", APIVersion:"networking.k8s.io/v1beta1", ResourceVersion:"49361142", FieldPath:""}): type: 'Normal' reason: 'CREATE' Ingress argocd/argocd-server-http-ingress
I1104 15:31:47.939507 6 status.go:86] new leader elected: nginx-ingress-controller-f8f7c9b69-smj4z
I1104 15:31:47.951045 6 controller.go:134] Configuration changes detected, backend reload required.
I1104 15:31:49.883927 6 main.go:153] Received SIGTERM, shutting down
I1104 15:31:50.595026 6 nginx.go:390] Shutting down controller queues
I1104 15:31:50.595482 6 status.go:117] updating status of Ingress rules (remove)
W1104 15:31:53.083266 6 template.go:120] unexpected error cleaning template: signal: terminated
E1104 15:31:53.088962 6 controller.go:146] Unexpected failure reloading the backend:
invalid NGINX configuration (empty)
W1104 15:31:53.093849 6 queue.go:130] requeuing initial-sync, err invalid NGINX configuration (empty)
I1104 15:31:53.098883 6 status.go:136] removing address from ingress status ([35.195.148.61])
I1104 15:31:53.197577 6 status.go:274] updating Ingress argocd/argocd-server-http-ingress status from [{35.195.148.61 }] to []
I1104 15:31:53.204586 6 status.go:274] updating Ingress etingroup/etingroup-sync status from [{35.195.148.61 }] to []
I1104 15:31:55.776450 6 nginx.go:406] Stopping NGINX process
2019/11/04 15:31:55 [notice] 31#31: signal process started
I1104 15:31:58.974559 6 nginx.go:419] NGINX process has stopped
I1104 15:31:59.072914 6 main.go:161] Handled quit, awaiting Pod deletion
I1104 15:32:09.568890 6 main.go:164] Exiting with 0
Is this a BUG REPORT or FEATURE REQUEST? (choose one):
BUG
Kubernetes version (use kubectl version):
kubectl version
Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.2", GitCommit:"c97fe5036ef3df2967d086711e6c0c405941e14b", GitTreeState:"clean", BuildDate:"2019-10-15T23:42:50Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.7-gke.10", GitCommit:"8cea5f8ae165065f0d35e5de5dfa2f73617f02d1", GitTreeState:"clean", BuildDate:"2019-10-05T00:08:10Z", GoVersion:"go1.12.9b4", Compiler:"gc", Platform:"linux/amd64"}
Environment:
GCP Kubernetes Engines
What happened:
Nginx restart in loop. It always fix itself after 20-30 minutes itself.
What you expected to happen:
Not restart / crash.
How to reproduce it (as minimally and precisely as possible):
I don't know. I only guess it can be about https://argoproj.github.io/argo-cd/ maybe? While both in the same cluster nginx-ingress crash constantly? Blind guess. But probably it is configuration values.yaml file. Hard to say. I need it to use nginx.ingress.kubernetes.io/whitelist-source-range.
It always happen when downgrade / upgrade nodes. So cluster have to re-run everything on new machines. And from time to time randomly.
My helm values.yaml
service:
externalTrafficPolicy: "Local"
loadBalancerIP: "35.195.XXX.XX"
omitClusterIP: true
publishService:
enabled: true
defaultBackend:
service:
omitClusterIP: true
I use Google Cloud Engine.
Another logs from crash:
kubectl --context=foo-production -n nginx-ingress logs -f nginx-ingress-controller-f8f7c9b69-zkkgx
-------------------------------------------------------------------------------
NGINX Ingress controller
Release: 0.26.1
Build: git-2de5a893a
Repository: https://github.com/kubernetes/ingress-nginx
nginx version: openresty/1.15.8.2
-------------------------------------------------------------------------------
I1107 12:59:49.823919 6 flags.go:198] Watching for Ingress class: nginx
W1107 12:59:49.824589 6 flags.go:243] SSL certificate chain completion is disabled (--enable-ssl-chain-completion=false)
W1107 12:59:49.824798 6 client_config.go:541] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
I1107 12:59:49.825271 6 main.go:182] Creating API client for https://10.55.240.1:443
I1107 12:59:50.020006 6 main.go:226] Running in Kubernetes cluster version v1.14+ (v1.14.7-gke.14) - git (clean) commit d7951e52ecdf56a1de9e031c553569ec13849a80 - platform linux/amd64
I1107 12:59:51.362836 6 main.go:90] Validated nginx-ingress/nginx-ingress-default-backend as the default backend.
I1107 13:00:05.219513 6 main.go:101] SSL fake certificate created /etc/ingress-controller/ssl/default-fake-certificate.pem
W1107 13:00:08.541308 6 store.go:635] Unexpected error reading configuration configmap: configmaps "nginx-ingress-controller" not found
I1107 13:00:10.153910 6 nginx.go:263] Starting NGINX Ingress controller
I1107 13:00:13.708645 6 backend_ssl.go:66] Adding Secret "argocd/argocd-secret" to the local store
I1107 13:00:13.710747 6 backend_ssl.go:66] Adding Secret "etingroup/main-etingroup-pl-tls" to the local store
I1107 13:00:13.712569 6 event.go:255] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"argocd", Name:"argocd-server-http-ingress", UID:"1e332eb9-8451-11e9-9fe8-42010a8400b2", APIVersion:"networking.k8s.io/v1beta1", ResourceVersion:"50486517", FieldPath:""}): type: 'Normal' reason: 'CREATE' Ingress argocd/argocd-server-http-ingress
I1107 13:00:13.712760 6 event.go:255] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"etingroup", Name:"etingroup-sync", UID:"35893ce3-84ae-11e9-90aa-42010a840106", APIVersion:"networking.k8s.io/v1beta1", ResourceVersion:"50486516", FieldPath:""}): type: 'Normal' reason: 'CREATE' Ingress etingroup/etingroup-sync
I1107 13:00:15.310010 6 nginx.go:307] Starting NGINX process
I1107 13:00:17.372980 6 leaderelection.go:241] attempting to acquire leader lease nginx-ingress/ingress-controller-leader-nginx...
I1107 13:00:18.204723 6 controller.go:134] Configuration changes detected, backend reload required.
I1107 13:00:23.934359 6 leaderelection.go:251] successfully acquired lease nginx-ingress/ingress-controller-leader-nginx
I1107 13:00:24.683612 6 status.go:86] new leader elected: nginx-ingress-controller-f8f7c9b69-zkkgx
I1107 13:00:27.969726 6 main.go:153] Received SIGTERM, shutting down
I1107 13:00:29.800005 6 nginx.go:390] Shutting down controller queues
I1107 13:00:29.800184 6 status.go:117] updating status of Ingress rules (remove)
I1107 13:00:29.838929 6 status.go:136] removing address from ingress status ([35.195.148.61])
I1107 13:00:29.858318 6 nginx.go:406] Stopping NGINX process
I1107 13:00:29.846504 6 status.go:274] updating Ingress etingroup/etingroup-sync status from [] to [{35.195.148.61 }]
I1107 13:00:29.846643 6 status.go:274] updating Ingress argocd/argocd-server-http-ingress status from [] to [{35.195.148.61 }]
2019/11/07 13:00:32 [notice] 30#30: signal process started
I1107 13:00:44.037409 6 nginx.go:419] NGINX process has stopped
I1107 13:00:44.642710 6 main.go:161] Handled quit, awaiting Pod deletion
E1107 13:00:50.088180 6 controller.go:146] Unexpected failure reloading the backend:
exit status 1
2019/11/07 13:00:45 [notice] 37#37: signal process started
2019/11/07 13:00:45 [error] 37#37: invalid PID number "" in "/tmp/nginx.pid"
nginx: [error] invalid PID number "" in "/tmp/nginx.pid"
W1107 13:00:50.088217 6 queue.go:130] requeuing initial-sync, err exit status 1
2019/11/07 13:00:45 [notice] 37#37: signal process started
2019/11/07 13:00:45 [error] 37#37: invalid PID number "" in "/tmp/nginx.pid"
nginx: [error] invalid PID number "" in "/tmp/nginx.pid"
I1107 13:00:54.750748 6 main.go:164] Exiting with 0
It can be useful

I1104 15:31:49.883927 6 main.go:153] Received SIGTERM, shutting down
This is not a crash. The kubelet is instructing the pod to terminate.
Please use kubectl describe <ingress pod> to see the reason
From the screenshot, it seems the kubelet cannot reach the ingress controller pod
This is not a crash.
When it is happening ALL ingress don't work. So users can't connect to ALL http(s) services in k8s. We can call it bug then :)
From the screenshot, it seems the kubelet cannot reach the ingress controller pod
Any tips how to debug it further? What can I do with it?
We can call it bug then :)
To confirm this is not an issue with the ingress controller itself, please remove the probes in the deployment. The ingress controller should not restart anymore
Any tips how to debug it further? What can I do with it?
SSH to the node where the pod is running and try to execute curl http://
SSH to the node where the pod is running and try to execute curl http://:10254/healthz
Do you mean gcloud compute ssh node or kubectl exec -it pod -- /bin/bash ssh to pod? I understand gcloud version. So I have to do it with IP http://IP:10254/healthz or not?.
remove the probes in the deployment
I don't see such option https://github.com/helm/charts/blob/master/stable/nginx-ingress/values.yaml#L179
How can I do it?
Do you mean gcloud compute ssh node
SSH to the Kubernetes node where the ingress controller pod is running.
How can I do it?
Editing the deployment (kubectl edit deployment) without using helm?
For people who have similar issues:
# read IP from pod
kubectl --context=xxx-production -n nginx-ingress describe pod nginx-ingress-controller-xxxxx
# log to node
gcloud beta compute --project "project-name" ssh --zone "europe-west1-b" "gke-production-pool-1-xxxxx"
curl -v http://IP:10254/healthz
Edit deployment to remove livenessProbe:
kubectl --context=xxx-production -n nginx-ingress edit deployment nginx-ingress-controller
In my case I have to also turn off sync in GitOps, because otherwise it will detect difference and overwrite probs.
Ha I found when it is happening!
curl -v http://10.52.3.15:10254/healthz
* Trying 10.52.3.15...
* TCP_NODELAY set
* Connected to 10.52.3.15 (10.52.3.15) port 10254 (#0)
> GET /healthz HTTP/1.1
> Host: 10.52.3.15:10254
> User-Agent: curl/7.64.1
> Accept: */*
>
< HTTP/1.1 500 Internal Server Error
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Thu, 07 Nov 2019 16:01:23 GMT
< Content-Length: 84
<
[+]ping ok
[-]nginx-ingress-controller failed: reason withheld
healthz check failed
* Connection #0 to host 10.52.3.15 left intact
* Closing connection 0
after a while there is no connection to nginx-ingress controller.
curl -v http://10.52.3.15:10254/healthz
* Trying 10.52.3.15...
* TCP_NODELAY set
* connect to 10.52.3.15 port 10254 failed: Connection refused
* Failed to connect to 10.52.3.15 port 10254: Connection refused
* Closing connection 0
curl: (7) Failed to connect to 10.52.3.15 port 10254: Connection refused
kubectl --context=etingroup-production -n nginx-ingress describe pod nginx-ingress-controller-f8f7c9b69-zkkgx | grep IP:
IP: 10.52.3.15
kubectl --context=etingroup-production -n nginx-ingress describe pod nginx-ingress-controller-f8f7c9b69-zkkgx
Name: nginx-ingress-controller-f8f7c9b69-zkkgx
Namespace: nginx-ingress
Priority: 0
Node: gke-production-pool-1-ce587bf0-g3mh/10.132.0.15
Start Time: Thu, 07 Nov 2019 09:08:09 +0100
Labels: app=nginx-ingress
component=controller
pod-template-hash=f8f7c9b69
release=nginx-ingress
Annotations: <none>
Status: Running
IP: 10.52.3.15
IPs: <none>
Controlled By: ReplicaSet/nginx-ingress-controller-f8f7c9b69
Containers:
nginx-ingress-controller:
Container ID: docker://6429cc9fcfb0d3dbd2832d306fd5deff3d33a980070445dadecc235a1a833725
Image: quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.26.1
Image ID: docker-pullable://quay.io/kubernetes-ingress-controller/nginx-ingress-controller@sha256:d0b22f715fcea5598ef7f869d308b55289a3daaa12922fa52a1abf17703c88e7
Ports: 80/TCP, 443/TCP
Host Ports: 0/TCP, 0/TCP
Args:
/nginx-ingress-controller
--default-backend-service=nginx-ingress/nginx-ingress-default-backend
--publish-service=nginx-ingress/nginx-ingress-controller
--election-id=ingress-controller-leader
--ingress-class=nginx
--configmap=nginx-ingress/nginx-ingress-controller
State: Running
Started: Thu, 07 Nov 2019 17:17:26 +0100
Last State: Terminated
Reason: Error
Exit Code: 143
Started: Thu, 07 Nov 2019 17:16:41 +0100
Finished: Thu, 07 Nov 2019 17:17:14 +0100
Ready: False
Restart Count: 32
Liveness: http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
Environment:
POD_NAME: nginx-ingress-controller-f8f7c9b69-zkkgx (v1:metadata.name)
POD_NAMESPACE: nginx-ingress (v1:metadata.namespace)
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from nginx-ingress-token-n59h8 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
nginx-ingress-token-n59h8:
Type: Secret (a volume populated by a Secret)
SecretName: nginx-ingress-token-n59h8
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 17m (x11 over 4h8m) kubelet, gke-production-pool-1-ce587bf0-g3mh Liveness probe failed: Get http://10.52.3.15:10254/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 16m (x12 over 4h8m) kubelet, gke-production-pool-1-ce587bf0-g3mh Readiness probe failed: Get http://10.52.3.15:10254/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 16m (x14 over 4h7m) kubelet, gke-production-pool-1-ce587bf0-g3mh Readiness probe failed: HTTP probe failed with statuscode: 500
Warning Unhealthy 16m kubelet, gke-production-pool-1-ce587bf0-g3mh Readiness probe failed: Get http://10.52.3.15:10254/healthz: read tcp 10.52.3.1:51828->10.52.3.15:10254: read: connection reset by peer
Warning Unhealthy 15m (x58 over 4h6m) kubelet, gke-production-pool-1-ce587bf0-g3mh Readiness probe failed: Get http://10.52.3.15:10254/healthz: dial tcp 10.52.3.15:10254: connect: connection refused
Warning Unhealthy 15m (x65 over 4h6m) kubelet, gke-production-pool-1-ce587bf0-g3mh Liveness probe failed: Get http://10.52.3.15:10254/healthz: dial tcp 10.52.3.15:10254: connect: connection refused
Normal Killing 15m (x25 over 4h8m) kubelet, gke-production-pool-1-ce587bf0-g3mh Container nginx-ingress-controller failed liveness probe, will be restarted
Normal Pulled 15m (x25 over 4h7m) kubelet, gke-production-pool-1-ce587bf0-g3mh Container image "quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.26.1" already present on machine
Normal Created 15m (x26 over 8h) kubelet, gke-production-pool-1-ce587bf0-g3mh Created container nginx-ingress-controller
Normal Started 15m (x26 over 8h) kubelet, gke-production-pool-1-ce587bf0-g3mh Started container nginx-ingress-controller
Warning BackOff 2m27s (x209 over 4h2m) kubelet, gke-production-pool-1-ce587bf0-g3mh Back-off restarting failed container
kubectl --context=etingroup-production -n nginx-ingress describe node 350ms Thu Nov 7 17:17:47 2019
Name: gke-production-pool-1-ce587bf0-g3mh
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/fluentd-ds-ready=true
beta.kubernetes.io/instance-type=n1-standard-2
beta.kubernetes.io/os=linux
cloud.google.com/gke-nodepool=pool-1
cloud.google.com/gke-os-distribution=cos
failure-domain.beta.kubernetes.io/region=europe-west1
failure-domain.beta.kubernetes.io/zone=europe-west1-b
kubernetes.io/arch=amd64
kubernetes.io/hostname=gke-production-pool-1-ce587bf0-g3mh
kubernetes.io/os=linux
Annotations: container.googleapis.com/instance_id: 1804273139999206952
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Mon, 04 Nov 2019 16:07:31 +0100
Taints: <none>
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
FrequentDockerRestart False Thu, 07 Nov 2019 17:19:37 +0100 Thu, 07 Nov 2019 09:08:05 +0100 NoFrequentDockerRestart docker is functioning properly
FrequentContainerdRestart False Thu, 07 Nov 2019 17:19:37 +0100 Thu, 07 Nov 2019 09:08:05 +0100 NoFrequentContainerdRestart containerd is functioning properly
KernelDeadlock False Thu, 07 Nov 2019 17:19:37 +0100 Thu, 07 Nov 2019 09:08:05 +0100 KernelHasNoDeadlock kernel has no deadlock
ReadonlyFilesystem False Thu, 07 Nov 2019 17:19:37 +0100 Thu, 07 Nov 2019 09:08:05 +0100 FilesystemIsNotReadOnly Filesystem is not read-only
CorruptDockerOverlay2 False Thu, 07 Nov 2019 17:19:37 +0100 Thu, 07 Nov 2019 09:08:05 +0100 NoCorruptDockerOverlay2 docker overlay2 is functioning properly
FrequentUnregisterNetDevice False Thu, 07 Nov 2019 17:19:37 +0100 Thu, 07 Nov 2019 09:08:05 +0100 NoFrequentUnregisterNetDevice node is functioning properly
FrequentKubeletRestart False Thu, 07 Nov 2019 17:19:37 +0100 Thu, 07 Nov 2019 09:08:05 +0100 NoFrequentKubeletRestart kubelet is functioning properly
NetworkUnavailable False Mon, 04 Nov 2019 16:07:47 +0100 Mon, 04 Nov 2019 16:07:47 +0100 RouteCreated RouteController created a route
MemoryPressure False Thu, 07 Nov 2019 17:19:43 +0100 Thu, 07 Nov 2019 09:08:07 +0100 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Thu, 07 Nov 2019 17:19:43 +0100 Thu, 07 Nov 2019 09:08:07 +0100 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Thu, 07 Nov 2019 17:19:43 +0100 Thu, 07 Nov 2019 09:08:07 +0100 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Thu, 07 Nov 2019 17:19:43 +0100 Thu, 07 Nov 2019 09:08:07 +0100 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: 10.132.0.15
..........
Capacity:
attachable-volumes-gce-pd: 127
cpu: 2
ephemeral-storage: 98868448Ki
hugepages-2Mi: 0
memory: 7657476Ki
pods: 110
Allocatable:
attachable-volumes-gce-pd: 127
cpu: 1930m
ephemeral-storage: 47093746742
hugepages-2Mi: 0
memory: 5777412Ki
pods: 110
System Info:
..........
Kernel Version: 4.14.145+
OS Image: Container-Optimized OS from Google
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://18.9.7
Kubelet Version: v1.14.7-gke.14
Kube-Proxy Version: v1.14.7-gke.14
PodCIDR: 10.52.3.0/24
..........
Non-terminated Pods: (22 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
argocd argocd-application-controller-674fd875bf-djcf2 0 (0%) 0 (0%) 0 (0%) 0 (0%) 8h
argocd argocd-dex-server-9d47975b-k54n8 0 (0%) 0 (0%) 0 (0%) 0 (0%) 8h
argocd argocd-redis-fc585c648-2l8j9 0 (0%) 0 (0%) 0 (0%) 0 (0%) 8h
argocd argocd-repo-server-564cf688d9-rt56k 0 (0%) 0 (0%) 0 (0%) 0 (0%) 8h
argocd argocd-server-6dcc6d6dcc-6p54f 0 (0%) 0 (0%) 0 (0%) 0 (0%) 8h
cert-manager cert-manager-756d9f56d6-x99t4 0 (0%) 0 (0%) 0 (0%) 0 (0%) 8h
cert-manager cert-manager-cainjector-74bb68d67c-vhsct 0 (0%) 0 (0%) 0 (0%) 0 (0%) 8h
cert-manager cert-manager-webhook-645b8bdb7-r5kl7 0 (0%) 0 (0%) 0 (0%) 0 (0%) 8h
etingroup etingroup-sync-5c79979f67-5nxzm 100m (5%) 2 (103%) 2Gi (36%) 2Gi (36%) 8h
kube-system event-exporter-v0.2.5-7df89f4b8f-hmzxv 0 (0%) 0 (0%) 0 (0%) 0 (0%) 8h
kube-system fluentd-gcp-scaler-54ccb89d5-hkj5v 0 (0%) 0 (0%) 0 (0%) 0 (0%) 8h
kube-system fluentd-gcp-v3.1.1-4h8nc 100m (5%) 1 (51%) 200Mi (3%) 500Mi (8%) 2d20h
kube-system heapster-64bb668f65-ptq5b 50m (2%) 50m (2%) 92560Ki (1%) 92560Ki (1%) 8h
kube-system kube-dns-5877696fb4-vvtsm 260m (13%) 0 (0%) 110Mi (1%) 170Mi (3%) 8h
kube-system kube-dns-autoscaler-85f8bdb54-wz9pq 20m (1%) 0 (0%) 10Mi (0%) 0 (0%) 8h
kube-system kube-proxy-gke-production-pool-1-ce587bf0-g3mh 100m (5%) 0 (0%) 0 (0%) 0 (0%) 8h
kube-system l7-default-backend-fd59995cd-9fxbv 10m (0%) 10m (0%) 20Mi (0%) 20Mi (0%) 8h
kube-system metrics-server-v0.3.1-57c75779f-ndkkj 48m (2%) 143m (7%) 105Mi (1%) 355Mi (6%) 8h
kube-system prometheus-to-sd-84kw6 1m (0%) 3m (0%) 20Mi (0%) 20Mi (0%) 34h
kube-system stackdriver-metadata-agent-cluster-level-744579f94d-56th5 40m (2%) 0 (0%) 50Mi (0%) 0 (0%) 8h
nginx-ingress nginx-ingress-controller-f8f7c9b69-zkkgx 0 (0%) 0 (0%) 0 (0%) 0 (0%) 8h
nginx-ingress nginx-ingress-default-backend-55f99b74bc-smmjk 0 (0%) 0 (0%) 0 (0%) 0 (0%) 8h
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 729m (37%) 3206m (166%)
memory 2717072Ki (47%) 3280272Ki (56%)
ephemeral-storage 0 (0%) 0 (0%)
attachable-volumes-gce-pd 0 0
Events: <none>
I have 1 URL which run process working about 20-30 minutes. Before it finish it always timeout in web browser. But it is not a problem for user. This link is to manually run synchronisation.
So I think nginx-igress controller stop working just after somebody open this link. It is what I discovered by experimenting manually when it is happening.
kubectl --context=etingroup-production top node
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
gke-production-pool-1-ce587bf0-g3mh 2001m 103% 2298Mi 40%
So to summary up:
Pod and start processing data which takes 20-30 minutes.How to debug it further? How to fix it?
Hmm alternately can it be because nginx-ingress-controller doesn't requests CPU on default settings and while data processing take all free CPU because of high limits, nginx-ingress-controller is too slow for healthz?
I can confirm after change CPU limits for pod processing data, nginx-ingress works stable.
I think it should be bolded warning in doc helm nginx-ingress installation and all places about that.
If processes will take 100% CPU, then nginx-ingress-controller will go down, which means nobody will be able to connect with your http(s) ingresses and all your systems
Default instalation doesn't requests CPU. It means if other processes will take free CPU because of their limits, then nginx-ingress reply too slow and http://:10254/healthz fail.
While it is not a bug, it is serious issue. I took about half a year to figure it out and nobody really knew why it was happening. It is not my first issue reported about it.
@kwladyka what are you expecting exactly? From your previous comment, this seems an issue without the limits in your cluster, there is nothing related to the ingress controller itself.
Default instalation doesn't requests CPU.
This is on purpose. You should set limits only when you get enough information about the traffic you need to handle, not before.
It means if other processes will take free CPU because of their limits, then nginx-ingress reply too slow and http://:10254/healthz fail.
The same thing happens to any other pod running without limits or using the BestEffort QoS class.
Also, maybe your nodes are small for the things you are running?
what are you expecting exactly?
requests for CPU to let nginx-ingress-controller always answer for http://:10254/healthz during health check to not fail all services in cluster, because of nginx-ingress-controller restart constantly. It should be probably really low value, but it should be.nginx-ingress-controller constantly when requests CPU is 0 and machine use 100% of CPU during http://:10254/healthz.So while nginx-ingress-controller fail, because of this, then other pods fail because can't connect to foo.bar.pl so all services start to fail and restart in loop and it turns really fast into chaos.
The same thing happens to any other pod running without limits or using the BestEffort QoS class.
It is very not obviously, because all processes work like they should. Only nginx-ingress-controller fail. At least in my case. So it is not happening with other processes in k8s.
You should set limits
The issue is not about nginx-ingress-controller limits, but requests or on the other hand limits of other pods which consume free CPU to get when needed.
When CPU is consumed in 100% on node, then nginx-ingress-controller immediatelly fail, because it doesn't have requests CPU so it answers too long for http://:10254/healthz (1 sec if I remember).
So for me it sounds like everybody should have CPU requests for nginx-ingress-controller or never let pods in node to use 100% of CPU, which sounds impossible to control.
Note: maybe using LimitRange could be possible to define the minimum requirements to avoid the 100% CPU problem https://kubernetes.io/docs/tasks/administer-cluster/manage-resources/cpu-constraint-namespace/#create-a-limitrange-and-a-pod
Most helpful comment
I can confirm after change CPU
limitsfor pod processing data, nginx-ingress works stable.I think it should be bolded warning in doc helm nginx-ingress installation and all places about that.
If processes will take 100% CPU, then nginx-ingress-controller will go down, which means nobody will be able to connect with your http(s) ingresses and all your systems
Default instalation doesn't
requestsCPU. It means if other processes will take free CPU because of theirlimits, then nginx-ingress reply too slow andhttp://:10254/healthzfail.While it is not a bug, it is serious issue. I took about half a year to figure it out and nobody really knew why it was happening. It is not my first issue reported about it.