Charts: [stable/prometheus-operator] Health check on prometheus pod does not consider DB recovery

Created on 21 Nov 2019  路  7Comments  路  Source: helm/charts

Describe the bug
I updated prometheus-operator ro the latest version. The prometheus pod has been restarted. The DB recovery begun, but then prometheus received a SIGTERM:

Version of Helm and Kubernetes:
> helm version Client: &version.Version{SemVer:"v2.14.3", GitCommit:"0e7f3b6637f7af8fcfddb3d2941fcc7cbebb0085", GitTreeState:"clean"} Server: &version.Version{SemVer:"v2.14.3", GitCommit:"0e7f3b6637f7af8fcfddb3d2941fcc7cbebb0085", GitTreeState:"clean"}

> kubectl version Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2", GitTreeState:"clean", BuildDate:"2019-08-19T11:13:54Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.1", GitCommit:"d647ddbd755faf07169599a625faf302ffc34458", GitTreeState:"clean", BuildDate:"2019-10-02T16:51:36Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}

Which chart:
stable/prometheus-operator

What happened:
level=info ts=2019-11-21T17:27:34.326Z caller=main.go:332 msg="Starting Prometheus" version="(version=2.13.1, branch=HEAD, revision=6f92ce56053866194ae5937012c1bec40f1dd1d9)" level=info ts=2019-11-21T17:27:34.326Z caller=main.go:333 build_context="(go=go1.13.1, user=root@88e419aa1676, date=20191017-13:15:01)" level=info ts=2019-11-21T17:27:34.326Z caller=main.go:334 host_details="(Linux 4.15.0-52-generic #56-Ubuntu SMP Tue Jun 4 22:49:08 UTC 2019 x86_64 prometheus-prometheus-0 (none))" level=info ts=2019-11-21T17:27:34.327Z caller=main.go:335 fd_limits="(soft=1048576, hard=1048576)" level=info ts=2019-11-21T17:27:34.327Z caller=main.go:336 vm_limits="(soft=unlimited, hard=unlimited)" level=info ts=2019-11-21T17:27:34.331Z caller=main.go:657 msg="Starting TSDB ..." level=info ts=2019-11-21T17:27:34.331Z caller=web.go:450 component=web msg="Start listening for connections" address=0.0.0.0:9090 level=info ts=2019-11-21T17:27:34.357Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1570645351042 maxt=1571140800000 ulid=01DQ85T30GF855KY7T1S9NANAD level=info ts=2019-11-21T17:27:34.372Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1571140800000 maxt=1571724000000 ulid=01DQSB5PK3AXGSYWJ9Q0VM3VY7 level=info ts=2019-11-21T17:27:34.373Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1571724000000 maxt=1572307200000 ulid=01DRAQDPNDNX2N4RJ5VCYB6V03 level=info ts=2019-11-21T17:27:34.375Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1572307200000 maxt=1572890400000 ulid=01DRW3KHGYR9Y89M8VN1K20BJJ level=info ts=2019-11-21T17:27:34.383Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1573473600000 maxt=1573480800000 ulid=01DT2DZHESN3F0STHERK8E54S5 level=info ts=2019-11-21T17:27:34.388Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1572890400000 maxt=1573473600000 ulid=01DT2E4NG4S8SNBB2GBXP7N6EB level=info ts=2019-11-21T17:27:34.398Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1574186400000 maxt=1574193600000 ulid=01DT2QE1AYZR8DGYA6V3558JRM level=info ts=2019-11-21T17:27:34.408Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1574193600000 maxt=1574200800000 ulid=01DT2Y9RJY28075FQHHQD19TPB level=warn ts=2019-11-21T17:32:44.971Z caller=main.go:501 msg="Received SIGTERM, exiting gracefully..." level=info ts=2019-11-21T17:32:44.972Z caller=main.go:526 msg="Stopping scrape discovery manager..." level=info ts=2019-11-21T17:32:44.972Z caller=main.go:540 msg="Stopping notify discovery manager..." level=info ts=2019-11-21T17:32:44.972Z caller=main.go:562 msg="Stopping scrape manager..."

What you expected to happen:
a clean pod startup, recovery of the DB

How to reproduce it (as minimally and precisely as possible):

  1. install older prometheus operator version
  2. create a huge amount of data (30GB)
  3. update to helm version prometheus-operator-8.2.2 (app version 0.34.0)

Anything else we need to know:
prometheus pod need several restarts to be ready
prometheus-prometheus-0 3/3 Running 2 14m

lifecyclstale

Most helpful comment

If anyone is wondering how it can be done:

prometheus:
  prometheusSpec:
    containers:
      - name: prometheus                            
        readinessProbe:  
          initialDelaySeconds: 60
          failureThreshold: 300

All 7 comments

The startup time depends on the size of the WAL that prometheus has written and not flushed to disk. The size of the database itself (30GB) doesn't affect it. The container restart behaviour would cause prometheus to re-start the WAL read from scratch, so if it ever got into a situation where it would take longer than allowable to read it, it would be in a crashloop and never recover. It sounds like the problem you are having is something else.

It is theoretically possible to provide health checks overrides for the pods generated by prometheus-operator through the prometheusSpec.containers property. Prometheus-operator does not offer a way to configure these directly. If you still believe there is an issue, please raise an issue here: coreos/prometheus-operator

hey @omitrowski

did you find a solution for this? I am having exactly the same problem.

@paulopontesm for now I live with the fact, that Prometheus is restarting 5 times til it get its DB back online. According to vsliouniaev this is by intention. "It's not a bug, it's a feature" 馃槈

According to vsliouniaev this is by intention

It is not the intention to restart prometheus 5 times to get its DB back online. Its restart process is not stateful, so this is an issue with your specific configuration. Prometheus is receiving a SIGTERM from the Kubernetes cluster - why?

What is the container termination reason that Kubernetes provides?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

This issue is being automatically closed due to inactivity.

If anyone is wondering how it can be done:

prometheus:
  prometheusSpec:
    containers:
      - name: prometheus                            
        readinessProbe:  
          initialDelaySeconds: 60
          failureThreshold: 300
Was this page helpful?
0 / 5 - 0 ratings