Prometheus-operator: prometheusrules: Malformed alert rule halts prometheus-operator rules processing

Created on 28 Aug 2018  路  31Comments  路  Source: prometheus-operator/prometheus-operator

What did you do?

Created a prometheusrule with a malformed alert rule.

What did you expect to see?

Prometheus ignore the bad rule, provide the name of the failed prometheusrule and continue it's operation.

What did you see instead? Under which circumstances?

Alert rules processing stopped.

Environment

prom operator running with:

-manage-crds=false

  • Prometheus Operator version:

    v0.23.2

  • Kubernetes version information:

    Server Version: version.Info{Major:"1", Minor:"8+", GitVersion:"v1.8.9+coreos.1", GitCommit:"cd373fe93e046b0a0bc7e4045af1bf4171cea395", GitTreeState:"clean", BuildDate:"2018-03-13T21:28:21Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

  • Kubernetes cluster kind:

    tectonic-installer

  • Manifests:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  clusterName: ""
  creationTimestamp: 2018-08-27T22:16:22Z
  deletionGracePeriodSeconds: null
  deletionTimestamp: null
  initializers: null
  labels:
    app: prometheus
    chart: prometheus-3.0.7
    heritage: Tiller
    prometheus: prometheus
    release: prometheus
    role: alert-rules
  name: prometheus-prometheus-rules
  namespace: foo
  resourceVersion: "134728661"
spec: |-
  ALERT high_load IF (1 - (avg by (instance, cpu, environment) (irate(node_cpu{mode="idle"}[1m])))) > 0.5 FOR 5m ANNOTATIONS { summary = "Instance {{ $labels.instance }} under high load", description = "{{ $labels.instance }} in {{$labels.environment}} is under high load.", }
  ALERT error_rate IF (1 - (avg by (instance, cpu, environment) (irate(controller_request_error_count[1m])))) > 200 FOR 5m ANNOTATIONS { summary = "Instance {{ $labels.instance }} high error rate count", description = "{{ $labels.instance }} in {{$labels.environment}} is have high error rate count.", }
  ALERT latency IF (avg by (instance, cpu, environment) (irate(http_request_duration_microseconds[1m]))) > 500 FOR 5m ANNOTATIONS { summary = "Instance {{ $labels.instance }} high error rate count", description = "{{ $labels.instance }} in {{$labels.environment}} is have high error rate count.", }

In the case above:

  • Prometheus operator logs:

Failed to list *v1.PrometheusRule: json: cannot unmarshal string into Go struct field PrometheusRule.spec of type v1.PrometheusRuleSpec

Another case:

A valid alert spec causes validation error:

````
groups:
- name: general.rules
rules:
- alert: noop
expr: 1
for: 1m
labels:
severity: noop
annotations:
summary: 'noop alert'
description: 'This alert is to enable validation that alerts can be triggered.'

````

The above rule makes prometheus operator to fail:

  • Prometheus Operator Logs:

E0823 21:39:59.146619 1 reflector.go:205] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:345: Failed to list *v1.PrometheusRule: json: cannot unmarshal number into Go struct field Rule.expr of type string
If -manage-crds=false is set to true, since the prometheus operator tries to pick up config maps and those might also have malformed alert rules. the operator crashes with error.

ts=2018-07-11T14:21:11.558466227Z caller=main.go:175 msg="Unhandled error received. Exiting..." err="creating CRDs failed: waiting for PrometheusRule crd failed: timed out waiting for Custom Resource: failed to list CRD: json: cannot unmarshal number into Go value of type string"

Most helpful comment

Thanks for the pointer!
If anyone is interested, this jq saved my day today:
kubectl get -n monitoring prometheusrule -o json|jq '.items[]|{name: .metadata.name, spec: (.spec|map(type)? // "string") }'

All 31 comments

@brancz as we discussed over slack.

Thanks!

Thanks @jescarri for the detailed bug report.

General: As of today the Prometheus Operator depends on the prerequisite, that no custom resources are malformed. If that is not the case it pauses its operations until the given custom resource is fixed. This issue is solved in the future via the CRD validation feature preventing to create malformed custom resources in the first place.

Why is expr: 1 not valid: The Prometheus ecosystem is yaml-focused, whereas the Kubernetes ecosystem is json-focused. While expr: 1 can be parsed into a Golang string via a yaml parser, Kubernetes first transforms yaml to json, where "expr": 1 is not seen as a valid input for a Golang string. This could be solved via custom unmarshal functions. Given the fact that a Prometheus rule expression in most cases is not just an integer, and the fact, that this would be invalid during creation time via Kubernetes validations, I am leaning towards not implementing custom unmarshal functions. A user could still write the expression as expr: "1" instead of expr: 1. @jescarri is that a valid option for you?

Why does the -manage-crds flag matter: When the -manage-crds flag is set to true the Prometheus Operator attempts to create the Prometheus, Alertmanager, ServiceMonitor and PrometheusRule custom resource definition. In order to check if they are properly in place, it runs a sample list operation. If there is already a malformed custom resource, the list operation fails, hence the error message:

ts=2018-07-11T14:21:11.558466227Z caller=main.go:175 msg="Unhandled error received. Exiting..." err="creating CRDs failed: waiting for PrometheusRule crd failed: timed out waiting for Custom Resource: failed to list CRD: json: cannot unmarshal number into Go value of type string"

@jescarri let me know if this is of any help. While I see that this is not a perfect solution for you, I do see this mostly addressed in the future via CRD validations.

@mxinden thanks for the response.

Ouch, the pre-requisite of properly formed CRDs makes sense for non "free text" specs, however with alert rules its really hard without the CRD validation, is there any way to make the operator to skip those malformed alertrules?.

for expr: 1 its a big problem mostly because now there's a new flavor of the alert rules spec, an alert rule with expr: 1 passes promtool, even if all our users lint their alert rules before installing them in the cluster, the operator will halt its operation.

Can the old way of managing the alert rules coexist with the CRD maybe via a flag select which method to use?

I agree that anything that passes promtool should be something the operator can work with. On list actions there is nothing we can do about unmarshalling errors, so for that.

I would propose that we just change the expr field from being strictly a string to intstr.IntOrString. That would solve this problem and not cause the operator to not work.

I would propose that we just change the expr field from being strictly a string to intstr.IntOrString. That would solve this problem and not cause the operator to not work.

Great idea @brancz . I will look into that.


In regards to preceding with all valid rules in case one rule is malformed in a rule group: I would like to bring up the issue of atomicity here. Say a user wants to create two new rules, one malformed recording rule and one valid alerting rule depending on the former recording rule. Skipping the recording and creating the alerting rule would result in a failing alerting rule evaluation.

For that reason I am advocating for atomic rule create/delete/update operations.

What are your thoughts?

I鈥檓 not sure atomicity would actually solve anything. As long as invalid rules can be created Prometheus won鈥檛 load them, so I think the right solution for this would be that any yaml PrometheusRule is a valid Prometheus rule file (so the intstr thing) and we implement an admission webhook to validate the result with promtool.

This is closed with #1845. @jescarri let me know if this is a valid solution. Feel free to reopen. As far as I understood, you are validating your rules via promtool, which should now be inline with the Prometheus Operator PrometheusRule format.

There seem to be two distinct issues here:

  1. a valid rule expr: 1 was being interpreted as invalid, blocking the operator
  2. invalid rules block the operator

the first was fixed by #1845; the second is still an issue. to solve the second we still need to implement the webhook admission controller to validate rules before they are accepted by the API.

the first was fixed by #1845; the second is still an issue. to solve the second we still need to implement the webhook admission controller to validate rules before they are accepted by the API.

precisely

Thanks guys, yep this will help us in the short term.

@dcondomitti implemented something to check for bad rules at processing time, maybe that fix can be added too.

Appreciate the help!.

Same thing happens if an annotations element has a value which is a number.
The logs don't give any help in working out which rule is causing the problem. Just:

E1017 14:01:54.775192       1 reflector.go:205] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:345: Failed to list *v1.PrometheusRule: json: cannot unmarshal number into Go value of type string

I am also seeing validation errors where the operator bails out. Is there any way for me to validate all my rules? I got 86 rules, with quite some content as well. How can I find which is breaking my operator?

Edit:
The message i see in the logs is:
ts=2018-10-23T08:42:59.884044646Z caller=main.go:193 msg="Unhandled error received. Exiting..." err="creating CRDs failed: waiting for PrometheusRule crd failed: timed out waiting for Custom Resource: failed to list CRD: json: cannot unmarshal string into Go struct field PrometheusRule.spec of type v1.PrometheusRuleSpec"

Both on v0.23.2 and 0.24.0 but it appeared after testing with v0.24.0, but did not disappear after rollback. It may not be caused by the upgrade, that may be a coincidence, but the correlation in time is there :-)

Sounds like one of your PrometheusRules spec field is a string, some filtering with jq should allow you to find the object that is causing trouble.

Thanks for the pointer!
If anyone is interested, this jq saved my day today:
kubectl get -n monitoring prometheusrule -o json|jq '.items[]|{name: .metadata.name, spec: (.spec|map(type)? // "string") }'

@mxinden / @brancz hey guys does this got fixed in any new release?

Gonna update our prometheus operator across the board and want to see if we can stop using our fork with the fixes that @dcondomitti added in PR https://github.com/coreos/prometheus-operator/pull/1871

It seems from the comments that It hasn't been addressed.

What's needed / How I can help to get the fix merged?

Thanks!

The origin of this issue should be solved with the IntOrString patch, but we don't have real validation in place yet. There have been some other thoughts though to use promtool in the prometheus operator and write the result of running promtool back into the PrometheusRule object status.

@brancz does the patch proposed by @dcondomitti on #1871 doesn't help in the meantime?

If that patch is rebased can it make it to a release?.
That patch has served us well the last couple of months, no problem at all and no interruption on service if someone puts a bad rule.

Thanks!

I think it's a good start. I would like to additionally see the error being written back to the PrometheusRule object as a condition on failure/success. Otherwise I'm happy with it.

@brancz cool I will try to find some time next week to submit the PR.

Thanks!

@brancz can you give me some pointers, I'm lost with the code refactor that happened few months ago.

The validation needs to be done at the lister, however the code that is now automatically generated, how I can add code to those aut-gen functions?.

Code needs to go on pkg/client/versioned/typed/monitoring/v1/prometheusrule.go in List method. basically just check if is a valid yaml, and I'm planning on adding some rule-validation as well.

as per https://github.com/coreos/prometheus-operator/blob/99ccc73bcb673127f237d798b979fa1caa720421/pkg/client/monitoring/v1/prometheusrule.go#L128-L168

Unfortunately I don't think that's possible anymore, it'll be overwritten on subsequent runs of the generator. There is a bright light though as https://github.com/coreos/prometheus-operator/pull/2551 was just opened, which I believe is the correct fix for this, as PrometheusRule objects won't even be possible to be created :slightly_smiling_face: .

yes that's what I was thinking too, well we will have to find those rules in advance before updating the operator.

thanks for the help!, and I look forward to test the admission web-hook.

This only solves part of the problem, since prometheus constant numeric values are (essentially) floats.

I was bitten by this original Prometheus YAML (validates fine with promtool):

 - record: cluster:required_fraction:ratio
   expr: 0.75

Some background, we run in-house clusters on physical hardware. We need to plan for machine failures, so for purposes of capacity planning, etc, we have a failure budget, but to make it easier in the rules, we express it as an "availability budget" (so for cores and RAM, we extrapolate the last few days of requests to "comfortably far into the future that we can get a purchase order signed, machines delivered, installed, built, joined to the cluster" before the projection exceeds " * cluster:required_fraction:ratio".

For us, the immediate fix was to simply change the float 0.75 to the string "0.75", but it's definitely the case that making Rule.Expr an intstr.IntOrStr doesn't completely solve the problem.

@vatine Did you report a bug for that?

https://github.com/coreos/prometheus-operator/pull/1845 fixed this starting with the v0.24.0 release.

Background: we operate shared kubernetes clusters with many different teams all using them. These teams want to provide their own rules, for their own prometheuses, all managed by a central prometheus-operator.

Due to various edge cases in PrometheusRule CRD loading, we wrote https://github.com/G-Research/prometheus-config-loader/ to ensure that only well-formed PrometheusRule CRDs get pushed to clusters. As a side-effect, we have enocuraged several teams to implement unit-testing for aggregations and alerts, since prometheus-config-loader makes it easy to test them.

How does it prevent a restarted/new Prometheus pod to fail due to a problem in the rules that are already mounted?

It doesn't fix the "it's already happened" case, instead it tries to prevent it from happening in the first place.

Prometheus-config-loader does two things to try to ensure that the situation can't occur. #1, it uses the same Go struct as prometheus-operator does, and uses the API exported to send PrometheusRule objects to the API server. #2, it will by default run promtool to syntax-check all prometheus rules files, before they are turned into PrometehusRule objects (yes, it's possible to disable that checking with a flag, but by default it is done).

So basically like the linting tool we have here plus promtool checking? https://github.com/coreos/prometheus-operator/blob/master/cmd/lint/main.go

Would you like to add promtool checking to it?

Plus a small amount of template expansion (we are feeding almost-identical rules to ~20 clusters, but some of them are test clusters, so should not have their page-alerts bubble up to pager duty).

Also, prometheus-config-loader takes "raw prometheus rules" as input (that way, we don't have to unmarshal the prometheus rules files before we feed them to promtool, althouhg we do template-expand them, so I guess that's strictly speaking not too different). That woiuld, at least, take care of the linting. Not sure how to deal with the unit-testing, though.

Both the syntax-checking and the unit-testing are important for us, so (for now) we'll probably continue to use the tool. But, it is open source, so if anyone happen to find it useful, we're in principle happy to look at PRs or issues.

For what it's worth the way we do it: we write alerting rules in jsonnet (which also allows templating etc.), and then generate the "normal" prometheus rule file out of it to run alert unit tests on, and then to actually apply we generate them into a PrometheusRule object.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

galexrt picture galexrt  路  81Comments

hartmut-pq picture hartmut-pq  路  30Comments

wleese picture wleese  路  49Comments

simox-83 picture simox-83  路  47Comments

kaarolch picture kaarolch  路  28Comments