Charts: prometheus-server CrashLoopBackOff

Created on 22 Jul 2019  路  16Comments  路  Source: helm/charts

Describe the bug
A clear and concise description of what the bug is.

Version of Helm and Kubernetes:
Helm Version: v2.14.2
Kubernetes Version: v1.15.0

Which chart:
stable/prometheus

What happened:
Pod named prometheus-server-66fbdff99b-z4vbj always in CrashLoopBackOff state

What you expected to happen:
prometheus-server pod supposed to start and running

How to reproduce it (as minimally and precisely as possible):
helm install stable/prometheus --name prometheus --namespace prometheus --set server.global.scrape_interval=5s,server.global.evaluation_interval=5s

Anything else we need to know:

lifecyclstale

Most helpful comment

I solved this problem with the below way.

kubectl edit deploy prometheus-server -n prometheus

from

      securityContext:
        fsGroup: 65534
        runAsGroup: 65534
        runAsNonRoot: true
        runAsUser: 65534

to

  securityContext:
    fsGroup: 0
    runAsGroup: 0
    runAsUser: 0

honestly, I am not sure this change won't cause another problem. but now it works

All 16 comments

I have the same issue: here describe of pod:Name: prometheus-server-55479c9d54-6gh9t
Namespace: monitoring
Priority: 0
PriorityClassName:
Node: phx3187268/100.111.143.19
Start Time: Tue, 30 Jul 2019 14:38:10 -0700
Labels: app=prometheus
chart=prometheus-8.15.0
component=server
heritage=Tiller
pod-template-hash=1103575810
release=prometheus
Annotations:
Status: Running
IP: 192.168.0.30
Controlled By: ReplicaSet/prometheus-server-55479c9d54
Containers:
prometheus-server-configmap-reload:
Container ID: docker://405fd0c96cb567d3182a7e6d2baa1d6ff5c7ae062fe79f7f3b8ceebc3032ec46
Image: jimmidyson/configmap-reload:v0.2.2
Image ID: docker-pullable://jimmidyson/configmap-reload@sha256:befec9f23d2a9da86a298d448cc9140f56a457362a7d9eecddba192db1ab489e
Port:
Host Port:
Args:
--volume-dir=/etc/config
--webhook-url=http://127.0.0.1:9090/-/reload
State: Running
Started: Tue, 30 Jul 2019 14:38:25 -0700
Ready: True
Restart Count: 0
Environment:
Mounts:
/etc/config from config-volume (ro)
/var/run/secrets/kubernetes.io/serviceaccount from prometheus-server-token-kb7pt (ro)
prometheus-server:
Container ID: docker://d5d45806e69bda9abfad75a6210d03ad7d6e9ecbc292de51af56440fc95cf162
Image: prom/prometheus:v2.11.1
Image ID: docker-pullable://prom/prometheus@sha256:8f34c18cf2ccaf21e361afd18e92da2602d0fa23a8917f759f906219242d8572
Port: 9090/TCP
Host Port: 0/TCP
Args:
--storage.tsdb.retention.time=15d
--config.file=/etc/config/prometheus.yml
--storage.tsdb.path=/data
--web.console.libraries=/etc/prometheus/console_libraries
--web.console.templates=/etc/prometheus/consoles
--web.enable-lifecycle
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Tue, 30 Jul 2019 14:38:52 -0700
Finished: Tue, 30 Jul 2019 14:38:52 -0700
Ready: False
Restart Count: 2
Liveness: http-get http://:9090/-/healthy delay=30s timeout=30s period=10s #success=1 #failure=3
Readiness: http-get http://:9090/-/ready delay=30s timeout=30s period=10s #success=1 #failure=3
Environment:
Mounts:
/data from storage-volume (rw)
/etc/config from config-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from prometheus-server-token-kb7pt (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: prometheus-server
Optional: false
storage-volume:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: pvc-prometheus
ReadOnly: false
prometheus-server-token-kb7pt:
Type: Secret (a volume populated by a Secret)
SecretName: prometheus-server-token-kb7pt
Optional: false
QoS Class: BestEffort
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 47s default-scheduler Successfully assigned monitoring/prometheus-server-55479c9d54-6gh9t to phx3187268
Normal Pulling 44s kubelet, phx3187268 pulling image "jimmidyson/configmap-reload:v0.2.2"
Normal Pulled 32s kubelet, phx3187268 Successfully pulled image "jimmidyson/configmap-reload:v0.2.2"
Normal Created 32s kubelet, phx3187268 Created container
Normal Started 32s kubelet, phx3187268 Started container
Normal Pulling 32s kubelet, phx3187268 pulling image "prom/prometheus:v2.11.1"
Normal Pulled 26s kubelet, phx3187268 Successfully pulled image "prom/prometheus:v2.11.1"
Warning BackOff 20s (x3 over 23s) kubelet, phx3187268 Back-off restarting failed container
Normal Created 5s (x3 over 25s) kubelet, phx3187268 Created container
Normal Started 5s (x3 over 25s) kubelet, phx3187268 Started container
Normal Pulled 5s (x2 over 24s) kubelet, phx3187268 Container image "prom/prometheus:v2.11.1" already present on machine
Warning DNSConfigForming 4s (x8 over 45s) kubelet, phx3187268 Search Line limits were exceeded, some search paths have been omitted, the applied search line is: monitoring.svc.cluster.local svc.cluster.local cluster.local devweblogicphx.oraclevcn.com subnet3ad3phx.devweblogicphx.oraclevcn.com us.oracle.com

Seeing a very similar type of activity on my dask-scheduler pod when implementing stable/dask in a ticket that I opened #15979

Here the log
[opc@marina-kogan-sandbox prometheus]$ kubectl -n monitoring logs prometheus-server-5bc5568444-5s8bk -c prometheus-server
level=info ts=2019-07-31T18:24:39.386Z caller=main.go:329 msg="Starting Prometheus" version="(version=2.11.1, branch=HEAD, revision=e5b22494857deca4b806f74f6e3a6ee30c251763)"
level=info ts=2019-07-31T18:24:39.386Z caller=main.go:330 build_context="(go=go1.12.7, user=root@d94406f2bb6f, date=20190710-13:51:17)"
level=info ts=2019-07-31T18:24:39.386Z caller=main.go:331 host_details="(Linux 4.14.35-1902.2.0.el7uek.x86_64 #2 SMP Fri Jun 14 21:15:44 PDT 2019 x86_64 prometheus-server-5bc5568444-5s8bk (none))"
level=info ts=2019-07-31T18:24:39.386Z caller=main.go:332 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2019-07-31T18:24:39.386Z caller=main.go:333 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2019-07-31T18:24:39.387Z caller=main.go:652 msg="Starting TSDB ..."
level=info ts=2019-07-31T18:24:39.387Z caller=web.go:448 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2019-07-31T18:24:39.388Z caller=main.go:521 msg="Stopping scrape discovery manager..."
level=info ts=2019-07-31T18:24:39.388Z caller=main.go:535 msg="Stopping notify discovery manager..."
level=info ts=2019-07-31T18:24:39.388Z caller=main.go:557 msg="Stopping scrape manager..."
level=info ts=2019-07-31T18:24:39.388Z caller=main.go:531 msg="Notify discovery manager stopped"
level=info ts=2019-07-31T18:24:39.388Z caller=main.go:517 msg="Scrape discovery manager stopped"
level=info ts=2019-07-31T18:24:39.388Z caller=main.go:551 msg="Scrape manager stopped"
level=info ts=2019-07-31T18:24:39.388Z caller=manager.go:776 component="rule manager" msg="Stopping rule manager..."
level=info ts=2019-07-31T18:24:39.388Z caller=manager.go:782 component="rule manager" msg="Rule manager stopped"
level=info ts=2019-07-31T18:24:39.388Z caller=notifier.go:602 component=notifier msg="Stopping notification manager..."
level=info ts=2019-07-31T18:24:39.388Z caller=main.go:722 msg="Notifier manager stopped"
level=error ts=2019-07-31T18:24:39.391Z caller=main.go:731 err="opening storage failed: lock DB directory: open /data/lock: permission denied"

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

having the same issue

Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 7m21s default-scheduler Successfully assigned monitoring/prometheus-server-75959db9-5v6dm to docker02
Normal Pulled 7m20s kubelet, docker02 Container image "jimmidyson/configmap-reload:v0.2.2" already present on machine
Normal Created 7m20s kubelet, docker02 Created container prometheus-server-configmap-reload
Normal Started 7m20s kubelet, docker02 Started container prometheus-server-configmap-reload
Normal Pulled 6m30s (x4 over 7m20s) kubelet, docker02 Container image "prom/prometheus:v2.11.1" already present on machine
Normal Created 6m30s (x4 over 7m20s) kubelet, docker02 Created container prometheus-server
Normal Started 6m30s (x4 over 7m20s) kubelet, docker02 Started container prometheus-server
Warning BackOff 2m19s (x27 over 7m19s) kubelet, docker02 Back-off restarting failed container

Also seeing this when simply running helm install stable/prometheus.

Helm Version: v2.14.3
Kubernetes Version: v1.14.6

level=info ts=2019-09-06T21:03:04.361Z caller=main.go:329 msg="Starting Prometheus" version="(version=2.11.1, branch=HEAD, revision=e5b22494857deca4b806f74f6e3a6ee30c251763)"
level=info ts=2019-09-06T21:03:04.361Z caller=main.go:330 build_context="(go=go1.12.7, user=root@d94406f2bb6f, date=20190710-13:51:17)"
level=info ts=2019-09-06T21:03:04.361Z caller=main.go:331 host_details="(Linux 4.9.184-linuxkit #1 SMP Tue Jul 2 22:58:16 UTC 2019 x86_64 kissing-warthog-prometheus-server-b94c6d879-n8jj9 (none))"
level=info ts=2019-09-06T21:03:04.361Z caller=main.go:332 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2019-09-06T21:03:04.361Z caller=main.go:333 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2019-09-06T21:03:04.362Z caller=web.go:448 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2019-09-06T21:03:04.362Z caller=main.go:652 msg="Starting TSDB ..."
level=info ts=2019-09-06T21:03:04.363Z caller=main.go:521 msg="Stopping scrape discovery manager..."
level=info ts=2019-09-06T21:03:04.363Z caller=main.go:535 msg="Stopping notify discovery manager..."
level=info ts=2019-09-06T21:03:04.363Z caller=main.go:557 msg="Stopping scrape manager..."
level=info ts=2019-09-06T21:03:04.363Z caller=main.go:531 msg="Notify discovery manager stopped"
level=info ts=2019-09-06T21:03:04.363Z caller=main.go:517 msg="Scrape discovery manager stopped"
level=info ts=2019-09-06T21:03:04.363Z caller=main.go:551 msg="Scrape manager stopped"
level=info ts=2019-09-06T21:03:04.363Z caller=manager.go:776 component="rule manager" msg="Stopping rule manager..."
level=info ts=2019-09-06T21:03:04.363Z caller=manager.go:782 component="rule manager" msg="Rule manager stopped"
level=info ts=2019-09-06T21:03:04.363Z caller=notifier.go:602 component=notifier msg="Stopping notification manager..."
level=info ts=2019-09-06T21:03:04.363Z caller=main.go:722 msg="Notifier manager stopped"
level=error ts=2019-09-06T21:03:04.364Z caller=main.go:731 err="opening storage failed: lock DB directory: open /data/lock: permission denied"

I'm seeing the same problem. Is there a solution or workaround?

Same problem here using helm install stable/prometheus.

caller=main.go:731 err="opening storage failed: lock DB directory: open /data/lock: permission denied"

Tried using "server.skipTSDBLock=true". It bypasses that step, but fails in the next:

main.go:731 err="opening storage failed: create dir: mkdir /data/wal: permission denied"

Then tried using server.persistentVolume.mountPath=/tmp as a test and it also fails:

main.go:731 err="opening storage failed: create dir: mkdir /tmp/wal: permission denied"

I was seeing the same error . I was able to resolve the issue by applying the workaround given here.
_Note: Replace prometheus-alertmanager with prometheus-server in the workaround steps._

I solved this problem with the below way.

kubectl edit deploy prometheus-server -n prometheus

from

      securityContext:
        fsGroup: 65534
        runAsGroup: 65534
        runAsNonRoot: true
        runAsUser: 65534

to

  securityContext:
    fsGroup: 0
    runAsGroup: 0
    runAsUser: 0

honestly, I am not sure this change won't cause another problem. but now it works

I am having the same issue - Changing securityContext does not fix it.
Has any body found a workaround for this?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

This issue is being automatically closed due to inactivity.

nice worked setting -

securityContext:
fsGroup: 0
runAsGroup: 0
runAsUser: 0

this fixed my issue also

securityContext:
fsGroup: 0
runAsGroup: 0
runAsUser: 0

so is there some other part of the setup/config that is expected to be done ahead of time thats missing?

Was this page helpful?
0 / 5 - 0 ratings