Describe the bug
With the updated prometheus rules and dashboards in https://github.com/helm/charts/pull/21899 the following issue was introduced https://github.com/coreos/kube-prometheus/issues/503
Essentially the code_verb:apiserver_request_total:increase30d now calculated metrics for 30d (previously it was only for 7d) and this loads utilizes more CPU and also is quite time consuming operation.
Version of Helm and Kubernetes: n/a
Which chart: stable/prometheus-operator >= 8.12.14
What happened: CPU utilization is increased and there are a lot of Prometheus is missing rule evaluations due to slow rule group evaluation. and level=warn ts=2020-04-20T18:17:38.237Z caller=manager.go:534 component="rule manager" group=kube-apiserver.rules msg="Evaluating rule failed" rule="record: code_verb:apiserver_request_total:increase30d\nexpr: sum by(code, verb) (increase(apiserver_request_total{job=\"apiserver\"}[30d]))\n" err="query processing would load too many samples into memory in query execution" issues
What you expected to happen: CPU utilization is ok and no rules are not skipping evaluation
How to reproduce it (as minimally and precisely as possible): Update to version 8.12.14 or later, have good amount of 30d data for apiserver_request_total and this will significantly increase the time and resources needed to compute
Anything else we need to know: As a workaround I have forked and patched the rules to again use 7d data only in the following branch https://github.com/smoke/charts/tree/prometheus-operator-workaround-code-verb-apiserver-request-total-increase30d
To use this workaround:
# git clone the relevant branch in a directory
git clone --depth 1 --single-branch --branch prometheus-operator-workaround-code-verb-apiserver-request-total-increase30d \
https://github.com/smoke/charts.git ~/smoke-prometheus-operator
# get the helm dependencies resolved
helm dependency update ~/smoke-prometheus-operator/stable/prometheus-operator
# install / update from the local chart
helm upgrade --install prometheus ~/smoke-prometheus-operator/stable/prometheus-operator --namespace monitoring --version 8.13.0
The discussion started https://github.com/helm/charts/pull/22003 and as I see resolution is not trivial and will be waiting for https://github.com/coreos/kube-prometheus/issues/503 I though best course of action is an issue here to track the problem.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.
:(
hey bot! keepalive :)
Can we sync now that https://github.com/coreos/kube-prometheus/pull/590 has been merged?
While personally I will be glad on that task too, in case if nobody will do it earlier.
I think we have to wait little bit more. There were quite a few PRs into the prometheus-operator during the last couple days
https://github.com/helm/charts/pulls?q=is%3Apr+is%3Aopen+prometheus-operator
Especially this one https://github.com/helm/charts/pull/22974
Author instead of updating scripts and then running them has updated resources manually. So our sync - will overwrite all his changes.
But in general that his PR has doubtful quality and amount of commits. (why not to just remove repo and do PR from the new fork?)
@den-is we have been in a holding pattern waiting for this fix since April, so it would be nice if we can sync as soon as possible.
I can help stage a PR by omitting the following change if that is what you are looking for:
- {{- $namespace := printf "%s" (include "prometheus-operator.namespace" .) }}
+ {{- $namespace := .Release.Namespace }}
@dsexton
sure, go ahead. I'm nobody here to hold any change to that repo/chart. not even a reviewer. just shared opinion. and same way as other do PRs time to time.
My only recommendation will be:
We may need to reopen this, still seeing too many samples errors for three of the rules.
Please reopen this. Happening way to often on one of our clusters.
prometheus-operator-8.16.1 version
prometheus-prometheus-operator-prometheus-0 prometheus level=warn ts=2020-07-22T08:00:20.162Z caller=manager.go:577 component="rule manager" group=kube-apiserver-availability.rules msg="Evaluating rule failed" rule="record: code_verb:apiserver_request_total:increase30d\nexpr: sum by(code, verb) (increase(apiserver_request_total{code=~\"2..\",job=\"apiserver\",verb=\"LIST\"}[30d]))\n" err="query processing would load too many samples into memory in query execution"
Most helpful comment
hey bot! keepalive :)