Charts: [prometheus-operator] code_verb:apiserver_request_total:increase30d is failing to evaluate since chart version 8.12.14

Created on 23 Apr 2020 · 11Comments · Source: helm/charts

Describe the bug
With the updated prometheus rules and dashboards in https://github.com/helm/charts/pull/21899 the following issue was introduced https://github.com/coreos/kube-prometheus/issues/503
Essentially the code_verb:apiserver_request_total:increase30d now calculated metrics for 30d (previously it was only for 7d) and this loads utilizes more CPU and also is quite time consuming operation.

Version of Helm and Kubernetes: n/a

Which chart: stable/prometheus-operator >= 8.12.14

What happened: CPU utilization is increased and there are a lot of Prometheus is missing rule evaluations due to slow rule group evaluation. and level=warn ts=2020-04-20T18:17:38.237Z caller=manager.go:534 component="rule manager" group=kube-apiserver.rules msg="Evaluating rule failed" rule="record: code_verb:apiserver_request_total:increase30d\nexpr: sum by(code, verb) (increase(apiserver_request_total{job=\"apiserver\"}[30d]))\n" err="query processing would load too many samples into memory in query execution" issues

What you expected to happen: CPU utilization is ok and no rules are not skipping evaluation

How to reproduce it (as minimally and precisely as possible): Update to version 8.12.14 or later, have good amount of 30d data for apiserver_request_total and this will significantly increase the time and resources needed to compute

Anything else we need to know: As a workaround I have forked and patched the rules to again use 7d data only in the following branch https://github.com/smoke/charts/tree/prometheus-operator-workaround-code-verb-apiserver-request-total-increase30d
To use this workaround:

# git clone the relevant branch in a directory
git clone --depth 1 --single-branch --branch prometheus-operator-workaround-code-verb-apiserver-request-total-increase30d \
https://github.com/smoke/charts.git ~/smoke-prometheus-operator

# get the helm dependencies resolved
helm dependency update ~/smoke-prometheus-operator/stable/prometheus-operator

# install / update from the local chart
helm upgrade --install prometheus ~/smoke-prometheus-operator/stable/prometheus-operator --namespace monitoring --version 8.13.0

The discussion started https://github.com/helm/charts/pull/22003 and as I see resolution is not trivial and will be waiting for https://github.com/coreos/kube-prometheus/issues/503 I though best course of action is an issue here to track the problem.

Source

smoke

👍14

Most helpful comment

hey bot! keepalive :)

blackjid on 8 Jun 2020

👍12

All 11 comments

just to interconnect stuff:
https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/411
https://github.com/coreos/kube-prometheus/pull/516
https://github.com/coreos/kube-prometheus/issues/503

den-is on 6 May 2020

👍2

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

stale[bot] on 5 Jun 2020

failytaler on 8 Jun 2020

hey bot! keepalive :)

blackjid on 8 Jun 2020

👍12

Can we sync now that https://github.com/coreos/kube-prometheus/pull/590 has been merged?

dsexton on 30 Jun 2020

While personally I will be glad on that task too, in case if nobody will do it earlier.
I think we have to wait little bit more. There were quite a few PRs into the prometheus-operator during the last couple days
https://github.com/helm/charts/pulls?q=is%3Apr+is%3Aopen+prometheus-operator

Especially this one https://github.com/helm/charts/pull/22974
Author instead of updating scripts and then running them has updated resources manually. So our sync - will overwrite all his changes.
But in general that his PR has doubtful quality and amount of commits. (why not to just remove repo and do PR from the new fork?)

den-is on 30 Jun 2020

@den-is we have been in a holding pattern waiting for this fix since April, so it would be nice if we can sync as soon as possible.

I can help stage a PR by omitting the following change if that is what you are looking for:

- {{- $namespace := printf "%s" (include "prometheus-operator.namespace" .) }}
+ {{- $namespace := .Release.Namespace }}

dsexton on 30 Jun 2020

@dsexton
sure, go ahead. I'm nobody here to hold any change to that repo/chart. not even a reviewer. just shared opinion. and same way as other do PRs time to time.
My only recommendation will be:

don't forget to remove current rules and dashboards before the sync operation - will ensure that you have clean upstream resources.
there is coredns dashboard which gets deleted because it's not in the upstream but still useful; so exclude it from deletion or just restore it later.
review all changes carefully/manually. I know, tidious work - been there several times :/

den-is on 30 Jun 2020

We may need to reopen this, still seeing too many samples errors for three of the rules.

dsexton on 9 Jul 2020

Please reopen this. Happening way to often on one of our clusters.

eyenx on 21 Jul 2020

👍4

prometheus-operator-8.16.1 version
prometheus-prometheus-operator-prometheus-0 prometheus level=warn ts=2020-07-22T08:00:20.162Z caller=manager.go:577 component="rule manager" group=kube-apiserver-availability.rules msg="Evaluating rule failed" rule="record: code_verb:apiserver_request_total:increase30d\nexpr: sum by(code, verb) (increase(apiserver_request_total{code=~\"2..\",job=\"apiserver\",verb=\"LIST\"}[30d]))\n" err="query processing would load too many samples into memory in query execution"