Describe the bug
When Prometheus crashes for any reason it is unable to come back up again.
Version of Helm and Kubernetes:
Which chart:
[stable/prometheus-operator]
What happened:
Any crash of the Prometheus pod apparently creates a corruption of the WAL on Prometheus. Once this happens the pod is unable to recover in time and the liveness probes kill it before I can work through the corrupt WAL.
The following options where used to install the chart:
Name: pulse-monitor
Namespace: monitoring
Persistent Volume Below:
prometheusSpec:
storageSpec:
volumeClaimTemplate:
metadata:
name: pvc
spec:
storageClassName: managed-premium
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
Once the pod has crashed and it restarts it tries to start but fails with a liveness probe 503 not available.
See the broken state below 2/3 pods ready.
$ kubectl get po -n monitoring
NAME READY STATUS RESTARTS AGE
alertmanager-pulse-monitor-prometheus-o-alertmanager-0 2/2 Running 0 4d19h
prometheus-pulse-monitor-prometheus-o-prometheus-0 2/3 Running 0 3m33s
pulse-monitor-grafana-f974449-vmbms 2/2 Running 0 4d19h
pulse-monitor-kube-state-metrics-84ff56f86b-4rtln 1/1 Running 0 4d19h
pulse-monitor-prometheus-node-exporter-k2b6j 1/1 Running 0 3d23h
pulse-monitor-prometheus-node-exporter-lw4bg 1/1 Running 0 4d19h
pulse-monitor-prometheus-node-exporter-m9b69 1/1 Running 0 4d19h
pulse-monitor-prometheus-node-exporter-p48fx 1/1 Running 0 4d19h
pulse-monitor-prometheus-o-operator-57c9cbbdbc-7cgff 2/2 Running 0 4d19h
If describe the broken pod it shows something like this:
Name: prometheus-pulse-monitor-prometheus-o-prometheus-0
Namespace: monitoring
Priority: 0
Node: aks-pulsedev01-14986555-vmss000000/172.15.20.4
Start Time: Tue, 10 Mar 2020 09:57:09 +1100
Labels: app=prometheus
controller-revision-hash=prometheus-pulse-monitor-prometheus-o-prometheus-8d546bfb4
prometheus=pulse-monitor-prometheus-o-prometheus
statefulset.kubernetes.io/pod-name=prometheus-pulse-monitor-prometheus-o-prometheus-0
Annotations: <none>
Status: Running
IP: 172.15.20.40
IPs: <none>
Controlled By: StatefulSet/prometheus-pulse-monitor-prometheus-o-prometheus
Containers:
prometheus:
Container ID: docker://8699fb42be37a1cd5815ff354cf85ae30087842a11d41b27b04e21dbd2b6fc32
Image: quay.io/prometheus/prometheus:v2.15.2
Image ID: docker-pullable://quay.io/prometheus/prometheus@sha256:914525123cf76a15a6aaeac069fcb445ce8fb125113d1bc5b15854bc1e8b6353
Port: 9090/TCP
Host Port: 0/TCP
Args:
--web.console.templates=/etc/prometheus/consoles
--web.console.libraries=/etc/prometheus/console_libraries
--config.file=/etc/prometheus/config_out/prometheus.env.yaml
--storage.tsdb.path=/prometheus
--storage.tsdb.retention.time=10d
--web.enable-lifecycle
--storage.tsdb.no-lockfile
--web.external-url=http://pulse-monitor-prometheus-o-prometheus.monitoring:9090
--web.route-prefix=/
State: Running
Started: Tue, 10 Mar 2020 09:59:46 +1100
Ready: False
Restart Count: 0
Liveness: http-get http://:web/-/healthy delay=0s timeout=3s period=5s #success=1 #failure=6
Readiness: http-get http://:web/-/ready delay=0s timeout=3s period=5s #success=1 #failure=120
Environment: <none>
Mounts:
/etc/prometheus/certs from tls-assets (ro)
/etc/prometheus/config_out from config-out (ro)
/etc/prometheus/rules/prometheus-pulse-monitor-prometheus-o-prometheus-rulefiles-0 from prometheus-pulse-monitor-prometheus-o-prometheus-rulefiles-0 (rw)
/prometheus from prometheus-pulse-monitor-prometheus-o-prometheus-db (rw,path="prometheus-db")
/var/run/secrets/kubernetes.io/serviceaccount from pulse-monitor-prometheus-o-prometheus-token-mwxld (ro)
prometheus-config-reloader:
Container ID: docker://c3dae7bda4ef4dfc6bb186865e0ac558f5d9d23fb4ac30ee603d09b499021620
Image: quay.io/coreos/prometheus-config-reloader:v0.36.0
Image ID: docker-pullable://quay.io/coreos/prometheus-config-reloader@sha256:74cb2dcf9d8c61f90fb28b82a0358962fbda956a798c762e0ddf1214bb7a9955
Port: <none>
Host Port: <none>
Command:
/bin/prometheus-config-reloader
Args:
--log-format=logfmt
--reload-url=http://127.0.0.1:9090/-/reload
--config-file=/etc/prometheus/config/prometheus.yaml.gz
--config-envsubst-file=/etc/prometheus/config_out/prometheus.env.yaml
State: Running
Started: Tue, 10 Mar 2020 09:59:54 +1100
Ready: True
Restart Count: 0
Limits:
cpu: 100m
memory: 25Mi
Requests:
cpu: 100m
memory: 25Mi
Environment:
POD_NAME: prometheus-pulse-monitor-prometheus-o-prometheus-0 (v1:metadata.name)
Mounts:
/etc/prometheus/config from config (rw)
/etc/prometheus/config_out from config-out (rw)
/var/run/secrets/kubernetes.io/serviceaccount from pulse-monitor-prometheus-o-prometheus-token-mwxld (ro)
rules-configmap-reloader:
Container ID: docker://d492c9db712280f7ef4fddf1cf84bdbcf950ee25bd258da4451321d3a4594307
Image: quay.io/coreos/configmap-reload:v0.0.1
Image ID: docker-pullable://quay.io/coreos/configmap-reload@sha256:e2fd60ff0ae4500a75b80ebaa30e0e7deba9ad107833e8ca53f0047c42c5a057
Port: <none>
Host Port: <none>
Args:
--webhook-url=http://127.0.0.1:9090/-/reload
--volume-dir=/etc/prometheus/rules/prometheus-pulse-monitor-prometheus-o-prometheus-rulefiles-0
State: Running
Started: Tue, 10 Mar 2020 09:59:59 +1100
Ready: True
Restart Count: 0
Limits:
cpu: 100m
memory: 25Mi
Requests:
cpu: 100m
memory: 25Mi
Environment: <none>
Mounts:
/etc/prometheus/rules/prometheus-pulse-monitor-prometheus-o-prometheus-rulefiles-0 from prometheus-pulse-monitor-prometheus-o-prometheus-rulefiles-0 (rw)
/var/run/secrets/kubernetes.io/serviceaccount from pulse-monitor-prometheus-o-prometheus-token-mwxld (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
prometheus-pulse-monitor-prometheus-o-prometheus-db:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: prometheus-pulse-monitor-prometheus-o-prometheus-db-prometheus-pulse-monitor-prometheus-o-prometheus-0
ReadOnly: false
config:
Type: Secret (a volume populated by a Secret)
SecretName: prometheus-pulse-monitor-prometheus-o-prometheus
Optional: false
tls-assets:
Type: Secret (a volume populated by a Secret)
SecretName: prometheus-pulse-monitor-prometheus-o-prometheus-tls-assets
Optional: false
config-out:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
prometheus-pulse-monitor-prometheus-o-prometheus-rulefiles-0:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: prometheus-pulse-monitor-prometheus-o-prometheus-rulefiles-0
Optional: false
pulse-monitor-prometheus-o-prometheus-token-mwxld:
Type: Secret (a volume populated by a Secret)
SecretName: pulse-monitor-prometheus-o-prometheus-token-mwxld
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 6m15s default-scheduler Successfully assigned monitoring/prometheus-pulse-monitor-prometheus-o-prometheus-0 to aks-pulsedev01-14986555-vmss000000
Warning FailedAttachVolume 6m15s attachdetach-controller Multi-Attach error for volume "pvc-fff51b37-c203-48d1-8d58-71dfe4ce3880" Volume is already exclusively attached to one node and can't be attached to another
Normal SuccessfulAttachVolume 5m3s attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-fff51b37-c203-48d1-8d58-71dfe4ce3880"
Warning FailedMount 4m12s kubelet, aks-pulsedev01-14986555-vmss000000 Unable to mount volumes for pod "prometheus-pulse-monitor-prometheus-o-prometheus-0_monitoring(a3265ac5-0337-4b12-8725-9b55b838dcaa)": timeout expired waiting for volumes to attach or mount for pod "monitoring"/"prometheus-pulse-monitor-prometheus-o-prometheus-0". list of unmounted volumes=[prometheus-pulse-monitor-prometheus-o-prometheus-db]. list of unattached volumes=[prometheus-pulse-monitor-prometheus-o-prometheus-db config tls-assets config-out prometheus-pulse-monitor-prometheus-o-prometheus-rulefiles-0 pulse-monitor-prometheus-o-prometheus-token-mwxld]
Normal Pulling 3m56s kubelet, aks-pulsedev01-14986555-vmss000000 Pulling image "quay.io/prometheus/prometheus:v2.15.2"
Normal Pulled 3m45s kubelet, aks-pulsedev01-14986555-vmss000000 Successfully pulled image "quay.io/prometheus/prometheus:v2.15.2"
Normal Created 3m38s kubelet, aks-pulsedev01-14986555-vmss000000 Created container prometheus
Normal Started 3m38s kubelet, aks-pulsedev01-14986555-vmss000000 Started container prometheus
Normal Pulling 3m38s kubelet, aks-pulsedev01-14986555-vmss000000 Pulling image "quay.io/coreos/prometheus-config-reloader:v0.36.0"
Normal Pulled 3m32s kubelet, aks-pulsedev01-14986555-vmss000000 Successfully pulled image "quay.io/coreos/prometheus-config-reloader:v0.36.0"
Normal Created 3m30s kubelet, aks-pulsedev01-14986555-vmss000000 Created container prometheus-config-reloader
Normal Started 3m30s kubelet, aks-pulsedev01-14986555-vmss000000 Started container prometheus-config-reloader
Normal Pulling 3m30s kubelet, aks-pulsedev01-14986555-vmss000000 Pulling image "quay.io/coreos/configmap-reload:v0.0.1"
Normal Pulled 3m26s kubelet, aks-pulsedev01-14986555-vmss000000 Successfully pulled image "quay.io/coreos/configmap-reload:v0.0.1"
Normal Created 3m25s kubelet, aks-pulsedev01-14986555-vmss000000 Created container rules-configmap-reloader
Normal Started 3m25s kubelet, aks-pulsedev01-14986555-vmss000000 Started container rules-configmap-reloader
Warning Unhealthy 2m27s (x12 over 3m22s) kubelet, aks-pulsedev01-14986555-vmss000000 Readiness probe failed: HTTP probe failed with statuscode: 503
When looking at the logs for the prometheus pod I see the following:
cornelius@namphi-ubuntu:/var/crash$ kubectl logs -f -n monitoring prometheus-pulse-monitor-prometheus-o-prometheus-0 prometheus
level=info ts=2020-03-10T01:06:52.353Z caller=main.go:330 msg="Starting Prometheus" version="(version=2.15.2, branch=HEAD, revision=d9613e5c466c6e9de548c4dae1b9aabf9aaf7c57)"
level=info ts=2020-03-10T01:06:52.353Z caller=main.go:331 build_context="(go=go1.13.5, user=root@688433cf4ff7, date=20200106-14:50:51)"
level=info ts=2020-03-10T01:06:52.353Z caller=main.go:332 host_details="(Linux 4.15.0-1069-azure #74-Ubuntu SMP Fri Feb 7 17:22:24 UTC 2020 x86_64 prometheus-pulse-monitor-prometheus-o-prometheus-0 (none))"
level=info ts=2020-03-10T01:06:52.353Z caller=main.go:333 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2020-03-10T01:06:52.353Z caller=main.go:334 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2020-03-10T01:06:52.456Z caller=main.go:648 msg="Starting TSDB ..."
level=info ts=2020-03-10T01:06:52.456Z caller=web.go:506 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2020-03-10T01:06:52.495Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1583637494073 maxt=1583647200000 ulid=01E2WWVRZ0PXGCKYH9CQ7PV6WH
level=info ts=2020-03-10T01:06:52.534Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1583647200000 maxt=1583712000000 ulid=01E2YKW6FP8AP7YKVDXF8KTRES
level=info ts=2020-03-10T01:06:52.573Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1583733600000 maxt=1583740800000 ulid=01E2Z8BEXGZQPHV850CARM95CV
level=info ts=2020-03-10T01:06:52.573Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1583712000000 maxt=1583733600000 ulid=01E2Z8HXVKZV5HJYKA82HYJ2R2
level=info ts=2020-03-10T01:06:52.573Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1583740800000 maxt=1583748000000 ulid=01E2ZF765BDTS0H5C4Y3RSGSAR
level=info ts=2020-03-10T01:07:32.098Z caller=head.go:584 component=tsdb msg="replaying WAL, this may take awhile"
level=info ts=2020-03-10T01:09:28.424Z caller=head.go:608 component=tsdb msg="WAL checkpoint loaded"
level=info ts=2020-03-10T01:09:42.418Z caller=head.go:632 component=tsdb msg="WAL segment loaded" segment=62 maxSegment=242
level=info ts=2020-03-10T01:09:56.439Z caller=head.go:632 component=tsdb msg="WAL segment loaded" segment=63 maxSegment=242
level=info ts=2020-03-10T01:10:10.551Z caller=head.go:632 component=tsdb msg="WAL segment loaded" segment=64 maxSegment=242
level=info ts=2020-03-10T01:10:25.043Z caller=head.go:632 component=tsdb msg="WAL segment loaded" segment=65 maxSegment=242
level=info ts=2020-03-10T01:10:32.093Z caller=head.go:632 component=tsdb msg="WAL segment loaded" segment=66 maxSegment=242
level=info ts=2020-03-10T01:10:46.789Z caller=head.go:632 component=tsdb msg="WAL segment loaded" segment=67 maxSegment=242
level=info ts=2020-03-10T01:11:02.835Z caller=head.go:632 component=tsdb msg="WAL segment loaded" segment=68 maxSegment=242
level=info ts=2020-03-10T01:11:20.737Z caller=head.go:632 component=tsdb msg="WAL segment loaded" segment=69 maxSegment=242
level=info ts=2020-03-10T01:11:38.966Z caller=head.go:632 component=tsdb msg="WAL segment loaded" segment=70 maxSegment=242
level=warn ts=2020-03-10T01:11:46.148Z caller=main.go:494 msg="Received SIGTERM, exiting gracefully..."
level=info ts=2020-03-10T01:11:46.148Z caller=main.go:517 msg="Stopping scrape discovery manager..."
level=info ts=2020-03-10T01:11:46.148Z caller=main.go:531 msg="Stopping notify discovery manager..."
level=info ts=2020-03-10T01:11:46.148Z caller=main.go:553 msg="Stopping scrape manager..."
level=info ts=2020-03-10T01:11:46.148Z caller=main.go:513 msg="Scrape discovery manager stopped"
level=info ts=2020-03-10T01:11:46.148Z caller=main.go:734 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=info ts=2020-03-10T01:11:46.148Z caller=main.go:527 msg="Notify discovery manager stopped"
level=info ts=2020-03-10T01:11:46.148Z caller=main.go:547 msg="Scrape manager stopped"
level=info ts=2020-03-10T01:11:46.156Z caller=kubernetes.go:190 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2020-03-10T01:11:46.158Z caller=kubernetes.go:190 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2020-03-10T01:11:46.159Z caller=kubernetes.go:190 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2020-03-10T01:11:46.160Z caller=kubernetes.go:190 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2020-03-10T01:11:46.161Z caller=kubernetes.go:190 component="discovery manager notify" discovery=k8s msg="Using pod service account via in-cluster config"
Essentially it looks like after the pod crashes it tries to recover the WAL however it is unable to recover in the time required by the liveness probes and thus it goes into a never ending restart loop.
What you expected to happen:
The pod to restart. I suspect this is due some underlying PVC issue on AKS.
How to reproduce it (as minimally and precisely as possible):
I have reliably reproduced this on several AKS clusters.Note that the pod will need to run out of memory or have some other hard crash for it to work. Killing/deleting the pod does not work.
Anything else we need to know:
To be honest I dont think this is an issue with the Helm install/chart however just reaching out to see if anyone has some form of guidance for me. I am probably going to use a VM based Prometheus installation shortly as this is not workable on AKS as is.
10 Days and not a single response. Time to close this and move on.
Decided to use VM based Prometheus instead.
me too facing same issue, @Namphibian did you find any solution?
@yogesh9391 I have managed it finally with some pvc values. I will post a solution here a little later.
@Namphibian would you be able to explain what did you change ?
Most helpful comment
@yogesh9391 I have managed it finally with some pvc values. I will post a solution here a little later.