Prometheus: Create a section ANNOTATIONS with user-defined payload and generalize RUNBOOK, DESCRIPTION, SUMMARY into fields therein.

Created on 18 Aug 2015  ·  102Comments  ·  Source: prometheus/prometheus

RUNBOOK was added in a hurry in #843 for an internal demo of one of our users, which didn't give it enough time to be fully discussed. The demo has been done, so we can reconsider this.

I think we should revert this change, and remove RUNBOOK:

  • Our general policy is that if it can be done with labels, do it with labels
  • All notification methods in the alertmanager will need extra code to deal with this
  • In future, all alertmanager notification templates will need extra code to deal with this
  • In general, all user code touching the alertmanager will need extra code to deal with this
  • This presumes a certain workflow in that you have something called a "runbook" (and not any other name - playbook is also common) and that you have exactly one of them

Runbooks are not a fundamental aspect of an alert, are not in use by all of our users and thus I don't believe they meet the bar for first-class support within prometheus. This is especially true considering that they don't add anything that isn't already possible with labels.

Most helpful comment

After 85 comments, many considerations and opinions, I now propose the following solution.

I encourage everyone to simply state :+1: or :-1: so this can function as a vote for a final conclusion. Pros and cons have been discussed extensively above.

  • The RUNBOOK field will be removed as requested.
  • The SUMMARY and DESCRIPTION fields will be removed.
  • The WITH clause will be renamed into LABELS.
  • The ANNOTATIONS clause is added to allow free-form key/value data that are not considered in the identity of an alert.
  • Values in LABELS and ANNOTATIONS are template-processed.

Example:

ALERT InstanceDown
  IF up == 0 FOR 10m
  LABELS {
    severity = "critical"
  }
  ANNOTATIONS {
    summary = "instance {{$labels.instance}} down"
    description = "instance {{$labels.instance}} has not been reachable for more than 10 minutes"
  }

The new Alertmanager will have configurable handling of annotation fields with reasonable defaults, e.g. summary and description.
The goal is no more configuration than previously for simple setups.

All 102 comments

Counterpoint from someone who doesn't feel too strongly either way: Having a special field for the runbook entry gives it clear semantic meaning and integrates well with notification mechanisms like PagerDuty, which do have a native runbook field and nicely display it when looking at an alert.

It seems this is mostly a debate between machine-readability (explicit semantics) of data vs. having to add code for each of these semantic special cases.

Then there's an argument for keeping it just because it exists already and removing it would be a breaking change.

I know @beorn7 feels strongest about this, so I think it's best if he discusses this directly with @brian-brazil before the rest of us will try to mediate :)

Having a special field for the runbook entry gives it clear semantic meaning and integrates well with notification mechanisms like PagerDuty, which do have a native runbook field and nicely display it when looking at an alert.

Are you sure about this? I can't find it in the PagerDuty docs (https://developer.pagerduty.com/documentation/integration/events/trigger). Adding new keywords in Prometheus for everything each notification method offers is not scalable or maintainable.

Then there's an argument for keeping it just because it exists already and removing it would be a breaking change.

If that's the case, we should set a higher bar on making changes - particularly ones added in a hurry. We're still not at 1.0, so shouldn't be scared of the occasional breakage.

On Tue, Aug 18, 2015 at 4:14 PM, Brian Brazil [email protected]
wrote:

Having a special field for the runbook entry gives it clear semantic
meaning and integrates well with notification mechanisms like PagerDuty,
which do have a native runbook field and nicely display it when looking at
an alert.

Are you sure about this? I can't find it in the PagerDuty docs (
https://developer.pagerduty.com/documentation/integration/events/trigger).
Adding new keywords in Prometheus for everything each notification method
offers is not scalable or maintainable.

Oh, you're right. I confused this with the client and client_url
fields, which get special treatment in PagerDuty's UI. I agree that we
shouldn't add new keywords for things that are too special for a given
notification method. Again, one could argue that RUNBOOK is general enough.

Then there's an argument for keeping it just because it exists already and
removing it would be a breaking change.

If that's the case, we should set a higher bar on making changes -
particularly ones added in a hurry. We're still not at 1.0, so shouldn't be
scared of the occasional breakage.

Agreed on the former! For the latter, breaking changes will always hurt
even before 1.0, but yeah, they're acceptable.

My strong feelings are about the relative importance of the runbook link compared to other fields. It's for example much more important than the summary. (I'm talking about large-scale organizations where you will be on-call for more than the few systems you work with day-to-day. Small-scare organizations arguably don't need Prometheus in the first place.)

I'm very much in favor of solving all of this with non-grouping labels. The runbook link was not introduced in a hurry but because we had already solved way less important alert components via dedicated fields.

Renamed the issue to reflect the discussion we had back in June.

I think this is a little trickier, as there will be labels coming from the alert that you don't want to group on (e.g. in a HA setup where each half has a different label to distinguish them) which is different from providing information about the alert itself as distinct from a particular instance of it.

Yeah, we have to solve that problem. But as long as we have no generic way to pass labels for things like runbook, summary, description or whatever else will show up in whatever notification mechanism is used, we cannot remove RUNBOOK. It's in active use at SC, and it is of crucial importance to enable receivers of an alert to act on it.

we have no generic way to pass labels for things like runbook, summary, description or whatever else will show up in whatever notification mechanism is used,

We have always had one of those in one labels.

It's in active use at SC, and it is of crucial importance to enable receivers of an alert to act on it.

We shouldn't be tying prometheus to the needs of a single company, and the question isn't whether we should support runbook but whether it deserves first-class support and all the maintenance and development burden it brings.

The current behaviour of runbook does not require a non-grouping label, as it's always a constant so it can be replaced by a grouping label today.

it is of crucial importance to enable receivers of an alert to act on it.

This is not my experience, I find looking at a playbook to be an act of desperation and worse than useless in general. This is why we need to be wary about adding features like this, not all companies and people have the same workflow and we need to keep Prometheus generic.

We have to conclude that our experiences differ.

I emphatically disagree that the RUNBOOK entry is based on the needs of a single company.

The runbook link is not necessarily a constant. A setup where the link depends on the alert name is easily imaginable ;). But we have to keep things even more flexible that that. Runbook link, summary, and description are all very similar as they need to be created from some kind of templating. More fields of that kind are possible depending on the use case and the used notification system. That's why we need a generic solution here, giving each user the flexibility to cater for their specific use case.

This. Having any of them hard coded into the statement was anti-Promethean in the first place.
Something like this is more fitting:

ALERT <name>
  IF <expr> FOR <dur>
  WITH {
    <label> = "<value>",
    ...
  }
  PAYLOAD {
    <label> = "<value>",
    ...
  }

Where the difference between WITH and PAYLOAD are attachment to time series and that PAYLOAD values should be templated.

The runbook link is not necessarily a constant. A setup where the link depends on the alert name is easily imaginable ;).

Indeed and we will support that someday, though that's still constant relative to the alertname.

But we have to keep things even more flexible that that. Runbook link, summary, and description are all very similar as they need to be created from some kind of templating.

It's not just those - all alert labels need to be able to be created from templating.

That's why we need a generic solution here, giving each user the flexibility to cater for their specific use case.

I'd propose templated labels as the solution, and then similarly templated notifications on the alertmanger side. None of this requires handling runbook specially.

Where the difference between WITH and PAYLOAD are attachment to time series and that PAYLOAD values should be templated.

It's a bit tricker, the most common case will be that the non-grouping labels are global labels so won't be part of the alert definition. This indicates towards a global setting of some form.

On Wed, Aug 26, 2015 at 1:13 PM, Brian Brazil [email protected] wrote:

I'd propose templated labels as the solution, and then similarly templated notifications on the alertmanger side. None of this requires handling runbook specially.

And neither description or summary. That's my whole point. I'm just
opposed to removing RUNBOOK (or SUMMARY or DESCRIPTION, although less
so) before we have a solution in place.

I tried to express that in the title of this issue. If somebody can
come up with a better title, be my guest.

Björn Rabenstein, Engineer
http://soundcloud.com/brabenstein

SoundCloud Ltd. | Rheinsberger Str. 76/77, 10115 Berlin, Germany
Managing Director: Alexander Ljung | Incorporated in England & Wales
with Company No. 6343600 | Local Branch Office | AG Charlottenburg |
HRB 110657B

Are we really thinking about even removing SUMMARY or DESCRIPTION as language-level constructs? Basically every notification mechanism depends on them, and thus we benefit a lot from statically checking that they are specified.

And neither description or summary. That's my whole point. I'm just
opposed to removing RUNBOOK

I don't see how this helps your use case as non-grouping is orthogonal to runbook as it never affects grouping.

What exactly are you trying to get into your notifications with this?
If you had used a label it'd have appeared in all notifications with no code changes required (at the moment it's only in pagerduty and opsgenie).

I'm not so sure about description and summary as they are going to be the main uses of templates, and unlike almost all other potential templated labels can vary significantly from eval to eval (e.g. a top 10 worst servers list or a string with the current value) - so you don't want to group on those or put them in timeseries.
I'm for keeping description and summary.

Perhaps the "non-grouping" is distracting... I want a very simple and obvious way to allow users to create labels via templates that can then be used to assemble the notification.

I'm opposed to treating "description" and/or "summary" special. Everything that applies to them also applies to "runbook" (and possibly other fields). I don't support the assumption that every notification mechanism depends on "description" and "summary" and those therefore need to be checked statically by Prometheus in contrast to other fields.

"runbook" or any other custom label potentially need exactly the same flexibility and variability as "description" and/or "summary". We should not put any assumption about const'ness of "runbook" into our design.

I want a very simple and obvious way to allow users to create labels via templates that can then be used to assemble the notification.

We're in agreement on that. Right now all we have is labels and hardcoded templates, but ultimately we should have go templates for all notification fields in the alertmanager.

I'm opposed to treating "description" and/or "summary" special. Everything that applies to them also applies to "runbook" (and possibly other fields).

Technically they're the same, in practice they're special. As indicated above they tend to vary from eval cycle to eval cycle, so can't be used safely as either grouping or non-grouping labels. The other use cases that initially come to mind that are similarly special (e.g. parametrised urls based on alert value) can be handled via summary/description so I'm not sure we need to genericise that - let's wait for a use case.

Based on this discussion, I've updated the summary.

I also realised there's another way to implement this without code changes: Put it in the description.

Reviewing this discussion, the only argument I see as to why this feature needs to stay as-is is that SoundCloud is using it. This has not prevented us from removing features in the past.

Obviously you can put into the description what you want. It _could_ be a runbook link. You could also put the description in the runbook field. However, my whole point is that a runbook link is so crucial that it most be treated as a 1st class citizen, for one so that you can prominently place the link in your resulting notification but also to signal to the naive user that they really have to fill in that field for meaningful alerts.

Now I completely realize that you disagree. And others. Possibly all of you. But I have to restate (over and over again) that we are not keeping the feature because Soundcloud is using it. We are keeping it because I deem it so important. (And Soundcloud uses it because I happen to be an a TL role there.)

You could all decide to overrule me. Unsolvable disagreements happen, and at some point we have to go one way or the other. (And this particular feature will surely not doom the project one way or the other.) However, it's moot to discuss over and over again if it matters if Soundcloud uses the feature because it's not about that. It's about the opinion of an important project member.

I restate that my suggestion is to acknowledge that use cases differ and generalize "annotations" (or whatever to call them). Let's implement that first, instead of kicking out an existing feature that is important (at least for me).

I recognise that there's a use case. My contention is that it's not important enough to justify increasing complexity for all users of alerting by forcing them to deal with this as a special case, compared to the benefit it brings.

generalize "annotations" (or whatever to call them).

What are you thinking here?

And I recognize that you don't think it's important enough. We have to agree that we disagree here.

The generalized labels would probably only need the ability to be expanded via the usual templating. If it's the right thing to enable template exansion in general within WITHor have a separate section I can't say.

# Current approach.
ALERT InstanceDown
  IF up == 0
  FOR 5m
  WITH {
    severity="page"
  }
  SUMMARY "Instance {{$labels.instance}} down"
  DESCRIPTION "{{$labels.instance}} of job {{$labels.job}} has been down for more than 5 minutes."
  RUNBOOK "http://wiki.int.example.com/InstanceDown?service={{$labels.job}}&zone={{$labels.zone}}"
# WITH with templating, to be vetted.
ALERT InstanceDown
  IF up == 0
  FOR 5m
  WITH {
    severity="page"
    summary="Instance {{$labels.instance}} down"
    description="{{$labels.instance}} of job {{$labels.job}} has been down for more than 5 minutes."
    runbook="http://wiki.int.example.com/InstanceDown?service={{$labels.job}}&zone={{$labels.zone}}"
  }

Ideally, we want some escaping magic or features to make it easy to render valid URLs.

The generalized labels would probably only need the ability to be expanded via the usual templating.

I'm for adding that for all labels (there's advanced use cases for it), so I can put that PR together.

We can't make summary/description labels though (and need to be a tad careful with runbook, if you're hitting that case you probably should put it in the description).

Cross-inspirations from https://github.com/prometheus/prometheus/issues/1043 : labels that go via template expansion change a lot, so are not really suitable as proper labels on the ALERT timeseries. So we could have a separate stanza for template expansions that are not labels.

ALERT InstanceDown
  IF up == 0
  FOR 5m
  WITH LABELS {
    severity="page"
  }
  WITH ANNOTATIONS {
    summary="Instance {{$labels.instance}} down"
    description="{{$labels.instance}} of job {{$labels.job}} has been down for more than 5 minutes."
    runbook="http://wiki.int.example.com/InstanceDown?service={{$labels.job}}&zone={{$labels.zone}}"
  }

labels that go via template expansion change a lot, so are not really suitable as proper labels on the ALERT timeseries.

Depends how you do it, where it usually comes up they're constant relative to a given alert (mostly for advanced alert routing) - so I'm not worried there.

Where the real problem is if someone does runbook="{{$value}}"

We've got four options there:

  • Don't support it
  • List it as a non-grouping label (still breaks ALERT)
  • Put it in the description
  • Generalise as you suggest

My instinct is that generalising will cause more problems than it will solve though making alert notifications too generic, and that we should leave summary/description as the only special cases. They do map pretty well to the notification methods we have thus far.

  • List it as a non-grouping label (still breaks ALERT)

Just for my understanding: how would a templated non-grouping label break ALERT?

Just for my understanding: how would a templated non-grouping label break ALERT?

It it had labels that weren't constant relative to the alert (e.g. contained {{$value}}) it'd create a new timeseries every eval cycle, which would be very difficult to deal with if you cared about the content of ALERT.

In practice, I think we'd recommend linking to a console (passing required labels) that'd display whatever time-series you want and avoid this problem. Accordingly I think side-stepping this problem is practical, and likely the best approach overall.

But I thought the whole purpose of non-grouping labels is that they
wouldn't be attached to any time series, but just serve as extra metadata
in alerts.
On Sep 1, 2015 15:57, "Brian Brazil" [email protected] wrote:

Just for my understanding: how would a templated non-grouping label break
ALERT?

It it had labels that weren't constant relative to the alert (e.g.
contained {{$value}}) it'd create a new timeseries every eval cycle,
which would be very difficult to deal with if you cared about the content
of ALERT.

In practice, I think we'd recommend linking to a console (passing required
labels) that'd display whatever time-series you want and avoid this
problem. Accordingly I think side-stepping this problem is practical, and
likely the best approach overall.


Reply to this email directly or view it on GitHub
https://github.com/prometheus/prometheus/issues/1002#issuecomment-136730618
.

But I thought the whole purpose of non-grouping labels is that they
wouldn't be attached to any time series, but just serve as extra metadata
in alerts.

You want non-grouping labels attached to alerts and timeseries, as it's how you'll distinguish time-series from different HA prometheis in your long-term storage. These would be typically configured at the prometheus-level (usually as global labels once we fix those) rather than the alert label.

What Beorn is proposing results in a 3rd class of key/value pair that only applies to alerts.

Ok, that's a definition confusion then. I hadn't heard that interpretation of "non-grouping" labels yet. I'm pretty sure @beorn7 also meant what I meant (labels that are somehow associated with alert notifications / displays, but not attached to the time series).

labels that are somehow associated with alert notifications / displays, but not attached to the time series

Right, those aren't labels then though as they don't have any of the other properties we'd associate with labels - they're just key-value metadata.

If we want to add that we need to determine what the use cases are and if it's worth it. Noone has yet proposed a use case to even begin considering adding that as a feature.

If we want to add that we need to determine what the use cases are and if it's worth it. Noone has yet proposed a use case to even begin considering adding that as a feature.

I'm not sure I understand. @beorn7 proposed exactly this kind of label (or just "key/value pair" if that's the better term) in https://github.com/prometheus/prometheus/issues/1002#issuecomment-134963120 for the use case of the various metadata items which are now keywords.

I'm not sure I understand. @beorn7 proposed exactly this kind of label (or just "key/value pair" if that's the better term) in #1002 (comment) for the use case of the various metadata items which are now keywords.

He proposed it, but he didn't provide a use case. Runbook doesn't require this feature, so we're left with summary/descrition which are probably best left as special cases. What is the problem we'd be solving by adding this?

@brian-brazil Ok, got it now what you mean. And you proposed that if someone needs a templated runbook, they could solve that by putting it into the description instead. I don't feel strongly at the moment, guess I'd be ok with that. But maybe others have a stronger opinion :)

Yeah, if you look at say how relabelling is used in SD that's way too generic from a usability standpoint - but when you consider maintenance/sanity it's the least worst option overall. If we're going to add a similar level of genericity to alerts metadata, we should have a very good reason.

The reason to generalize was that already among the few of us, we have widely diverging ideas about which metadata is crucial and which not.

Asking the user to put the runbook link somewhere in the description doesn't help my use case. I want a notification template that creates big and fat red letters "MISSING RUNBOOK" if there is no runbook. I want a way to automatically identify alerts that have no runbook. In most cases, the runbook link will be completely static, so there is no problem putting it into a label. But (and we are running in circles here) I strongly dislike the fact that something like DESCRIPTION and SUMMARY (which are in most cases syntactic sugar over the alert expression) are 1st class meta data but the one field that distinguish a "I have no clue what is alerting me here" from "Yeah, I can start and resolve the outage immediately" is buried somewhere...

I don't really get what is so complicated for the user if the meta-data keys are freely defineable, but I do see how we can accommodate all possible use cases discussed here with it.

I want a notification template that creates big and fat red letters "MISSING RUNBOOK" if there is no runbook.

Thanks for the uses cases. That we can do with notification templates for first-class/label. If we allow regex matching in notification templates, that should work for descriptions too (I'd rather avoid going the full complexity of console templates for notification templates, as that'd be confusing for users to figure out what goes where, but string matching should be safeish to offer).

I want a way to automatically identify alerts that have no runbook.

That's a more interesting use case. Let's presume there's a parser of some form exposed, as it's going to be difficult to do otherwise. First-class keyword and label work out of the box, description you're looking at a regex again.

We should probably add that parser to the roadmap as there's a multitude of use cases, some part of promtool I guess?

I don't really get what is so complicated for the user if the meta-data keys are freely defineable, but I do see how we can accommodate all possible use cases discussed here with it.

If you're creating a notification template, you can no longer simply iterate over all the labels to pull out all relevant information. You'd need to iterate twice, and still special case summary/description as you'll likely want them on the top/bottom respectively.

There is other metadata we're passing currently that's not user controlled, such as a link to the alerting expression, the expression and the active time and which a user may or may not want in their notifications - so that may all need special casing too. This is currently a bit of a mess as it's passed in as a key-value pair, but we presume it has certain members - so all of those should be part of a defined schema.

I adjusted the title to get rid of the misleading reference to non-grouping labels.

Another use case that was already discussed in some depth is to allow notifications that contain little pre-rendered graphs describing the current condition, let's say slack notifications with embedded pictures. The logic that assembles the notification would be in a special slack module, but it would require specific meta data, which could be easily configured in the alerting expression, but we need a way to propagate it.

Another use case that was already discussed in some depth is to allow notifications that contain little pre-rendered graphs describing the current condition, let's say slack notifications with embedded pictures.

I'm wary of that reliability wise, as we need to keep the time to evaluate alerts strictly bounded.
Console templates are also not currently powerful enough to do this.

require specific meta data

Which meta data in particular?

Let's boil this down to the basics:

"Non-grouping labels" was indeed a red herring. As Brian has explained, it's something completely different, in my understanding orthogonal to the feature we are discussing here.

Here we only need to talk about two things alerts have:

  • Labels - which are just normal labels of the ALERT time series with all implications, currently set in the WITH stanza. The only question here is if we want to allow using the template language to set the value of a label (which has the risk of accidentally screwing up the intended time series behavior but might serve specific advanced use cases).
  • Metadata - which is key-value structured (as labels) but only attached at the time an alert is sent to the alert manager. There is currently no way for the user to set arbitrary key-value pairs in the metadata. However, metadata (like labels) should be freely accessible for whatever mechanism is used to assemble a notification. (E.g. if I'm assembling an email notification, I want to include any key or value at any place in the mail.)

The data structures currently already allow arbitrary key-value pairs in the metadata. The big question is: "Do we want to allow users to set arbitrary key-value pairs?" The answer will be derived from weighing the answers to the following two questions:

  1. What are the risks and the overhead of allowing users to set arbitrary key-value pairs?
  2. What are the benefits of allowing users to set arbitrary key-value pairs?
    Once we have answered the "big question", we can discuss if we want to remove the special treatment of DESCRIPTION, SUMMARY, and RUNBOOK and just treat them as any other arbitrary key-value pair.

My take: We can probably discuss (2) at length. I'm very confident that the need for user-specified key-value pairs will come up. It's not the "if" but more the "will the benefits outweigh the risks", so I believe it is more helpful to discuss (1) in more depth, as I see very little risk and overhead in allowing arbitrary key-value pairs. I haven't really understood the concerns voiced so far, however, so a clearer explanation thereof would be appreciated. If we then follow through and generalize DESCRIPTION, SUMMARY, and RUNBOOK, we would arguably even _reduce_ overhead. Those three glorified fields are currently dragged through the whole stack, as syntactic elements in the language, as fields in the data structure, and what not, only to end up pasted into a message string at some point. Since each of those fields will be optional, I don't really see a justification for the special full-stack treatment.

@brian-brazil About my example with inserted little graph pics: This has nothing to do with alert evaluation. It's just an added piece of information, at best effort level. The metadata needed to include a separate expression and a time frame (start and stop) for which to evaluate it (on the Prometheus server that sends the alert). The evaluation of the expression would happen at the time alertmanager renders the notification message (in my example a Slack message, and it would happen best effort, so inability to render would not block delivery). The Slack module of alertmanager would retrieve the evaluation result and plot it. While this example is made up and certainly not thought through, it's inspired by existing tools for Nagios, where similar features were requested by users, and we realized how much simpler such a feature could be implemented in the Prometheus ecosystem.

very confident that the need for user-specified key-value pairs will come up. It's not the "if" but more the "will the benefits outweigh the risks"

I'd like to understand the use cases more, the only one suggested so far is something to do with slack notifications so I'd like to flesh out what these potential use cases are as I'm having difficulty coming up with ones myself.

Since each of those fields will be optional, I don't really see a justification for the special full-stack treatment.

There's very little justification for runbook being special (which is why I want to remove it), however that's not the case for summary/description. Every notification method has something that looks like at least one of those, so having them special allows users to create alerts that produce reasonable notification output for all notification methods out of the box.

If that's genericised, the likely outcome is that users will end up unnecessarily doing over-complicated and over-engineered things with alert notifications; effectively there'll be too much freedom. If we keep summary/description special things will Just Work for the vast majority of users, the more advanced use cases are generally still possible and we don't end up in that place.

I'm proposing having summary/description optional to make it easier for users to initially ramp up, it's also conceivable that for some alerts description won't be useful due to only using notification methods with limited message lengths.

While this example is made up and certainly not thought through, it's inspired by existing tools for Nagios, where similar features were requested by users, and we realized how much simpler such a feature could be implemented in the Prometheus ecosystem.

I think it's something that'd be valuable to support in some fashion.

The evaluation of the expression would happen at the time alertmanager renders the notification message (in my example a Slack message, and it would happen best effort, so inability to render would not block delivery). The Slack module of alertmanager would retrieve the evaluation result and plot it.

The sort of feature doesn't belong in the alertmanager for reliability reasons (and there's a separation of concerns argument too), but we can kick it elsewhere without affecting the core idea.

My initial thoughts on how to do that would be an IMG tag in the description, and the users browser/email client/slack would request that from some suitable server. The only real problem there is the timestamps to use, and prometheus and the alertmanager would both have access to that to pass in as a url parameter. So in principal that could even be done with a normal label. Maybe we should have Prometheus shell out to gnuplot :)

One thing to be careful with this what happens if there's hundreds of these in a single notification, as that could cause an overload for Prometheus.

On Wed, Sep 2, 2015 at 1:20 PM, Brian Brazil [email protected] wrote:

There's very little justification for runbook being special (which is why I want to remove it), however that's not the case for summary/description. Every notification method has something that looks like at least one of those, so having them special allows users to create alerts that produce reasonable notification output for all notification methods out of the box.

We have already concluded that our views fundamentally differ here.
There is no point in repeating that over and over.

Björn Rabenstein, Engineer
http://soundcloud.com/brabenstein

SoundCloud Ltd. | Rheinsberger Str. 76/77, 10115 Berlin, Germany
Managing Director: Alexander Ljung | Incorporated in England & Wales
with Company No. 6343600 | Local Branch Office | AG Charlottenburg |
HRB 110657B

WRT fleshing out the use cases: I'm fine waiting with the generalization decision until we have the use cases more concretely.

We just cannot remove RUNBOOK until then because every notification method needs it.

WRT fleshing out the use cases: I'm fine waiting with the generalization decision until we have the use cases more concretely.

I can agree to that.

We just cannot remove RUNBOOK until then because every notification method needs it.

This is not the case, many notification methods don't implement RUNBOOK and none of them have RUNBOOK or equivalent as a first-class feature. Thus every notification method does not need it. Everything currently possible with RUNBOOK is currently possible with a normal label.

No use cases for RUNBOOK have been presented that require it to be special, it currently appears to be special for the sake of being special.

Notification methods that do not use RUNBOOK are buggy and need to be fixed.
Every notification needs a link to a runbook. That must not be hidden in the description. There must be a way to render it as a proper link at a well defined position.

I stop repeating the same arguments over and over now. My point is made. If the whole world feels we don't need runbook, then go for it, and I'll silently weep in the corner. But if it's only @brian-brazil who desperately wants to remove it now, I think we should keep it until we have a way to cater for my use case in a different way.

Every notification needs a link to a runbook.

This is how you want your notifications to appear, it doesn't apply to everyone.

There must be a way to render it as a proper link at a well defined position.

So this is your core use case. Would notification templates satisfy this?

It's unclear that the current implementation meets this use case, as only in email notifications in which it's in a clear position (below the alertmanger url and description) for the other two it's a random key-value entry.

Really, everything I can say now has already been said. I'm sorry.

Here's Beorn's reasons as part of a discussion on IRC last week on this:

11:38 @beorn7 (1) Yes, I do truly believe that every alert should have a runbook link. This experience stems from two very different organizations, also from feedback from very different people. Yes, there are cases where this is not needed, but that's in smaller orgs which are arguably not the main target groups of Prometheus. And yes again, I do have to acknowledge that not everybody is of that opinion so we won't be able to mandate a runbook entry, but we have already

11:40 @beorn7 (2) As it is difficult to educate writers of alerting rules to always add a runbook link, I emphtically reject the possibility to have the runbook link treated in a syntactically 2nd class position. Of course, I personally would be happy to sink the runbook link somewhere into a label or as part of DESCRIPTION, but I know I'll have an impossible job teaching everybody in a large organization to do the same. I want runbook to be 1st class mainly due to educat

11:44 @beorn7 (3) Yes, currently most notification mechanism don't support the 1st class runbook entry (which is obvious as we only added it recently). I do want this to be changed (but I'm not working on alerting myself.) Arguably, that ever supports my case for having it 1st class, so that writers of new notification mechanism at least consider the option of adding it somewhere for those that use it.

11:45 @beorn7 (4) While I feel strongly about all of the above, I won't enforce my opinion against all of you. But currently, it looks like my strong opinion against Brian's strong opinion, and everbody else doesn't feel strongly either way.

There also appears to be general consensus that we'll have notification templates.

I do truly believe that every alert should have a runbook link.

I believe that's a valid use case, and something we should support in some fashion. Notification templates will offer an additional option, as you could put in the link automatically presuming your setup allows calculating the link from the alertname.

I personally would be happy to sink the runbook link somewhere into a label or as part of DESCRIPTION, but I know I'll have an impossible job teaching everybody in a large organization to do the same. I want runbook to be 1st class mainly due to educat

Yes, currently most notification mechanism don't support the 1st class runbook entry ... Arguably, that ever supports my case for having it 1st class, so that writers of new notification mechanism at least consider the option of adding it somewhere for those that use it.

I see this as a strong reason against. Having this as 1st class requires that not only us as developers have an increased maintenance burden, and in addition that our users will also have this burden once notification templates are implemented.

Adding something just so humans manually have to fill it out, because they don't really want to fill it out in the first place, doesn't seem like a great strategy in terms of your goal of getting people to use runbooks. It seems more likely to me that those who were going to not put in a runbook will continue to not do so, while still causing increased maintenance costs.

This seems more a social problem than a technical one, and this is not an effective technical solution to it. What would work as technical solutions are some form of pre-commit check that alerts have the properties you want (which as discussed, we should add support for to promtool) or if you're adding the link in the alertmanager some tool that checks that all alerts have matching runbooks.

Adding complexity to Prometheus to server as an educational tool that may or may not work doesn't seem like a good tradeoff to me. Education such as talks and presentations would seem more beneficial to the goal. I also note that this goal is very different than what was originally presented as the justification for this feature.

In addition if we ever do have some form of label that isn't used for grouping in alerts we'd still likely treat SUMMARY and DESCRIPTION specially due to how they map reasonably well to notifications. Renaming description back accordingly.

On Thu, Sep 24, 2015 at 2:22 PM, Brian Brazil [email protected]
wrote:

I believe that's a valid use case, and something we should support in some
fashion. Notification templates will offer an additional option, as you
could put in the link automatically presuming your setup allows calculating
the link from the alertname.

That a big presumption. Didn't work at Google. Won't work at SoundCloud. I
want something better.

Yes, currently most notification mechanism don't support the 1st class
runbook entry ... Arguably, that ever supports my case for having it 1st
class, so that writers of new notification mechanism at least consider the
option of adding it somewhere for those that use it.

I see this as a strong reason against. Having this as 1st class requires
that not only us as developers have an increased maintenance burden, and in
addition that our users will also have this burden once notification
templates are implemented.

That's true for every "fixed" field. We already have user feedback asking
why there are three different fields: alertname, summary, description.
_Any_ partitioning is somehow arbitrary, and while the summary-description
partition works well for emails (arguably the _worst_ notification mode for
alerts), it's not a fit for SMS (which only want a short form, i.e. the
summary) or chat messages (which only want the long form, i.e. description).

Adding something just so humans manually have to fill it out, because they
don't really want to fill it out in the first place, doesn't seem like a
great strategy in terms of your goal of getting people to use runbooks. It
seems more likely to me that those who were going to not put in a runbook
will continue to not do so, while still causing increased maintenance costs.

I disagree. But I guess no point in arguing about it.

This seems more a social problem than a technical one, and this is not an
effective technical solution to it. What would work as technical solutions
are some form of pre-commit check that alerts have the properties you want
(which as discussed, we should add support for to promtool) or if you're
adding the link in the alertmanager some tool that checks that all alerts
have matching runbooks.

But then the same is true for description and summary. Every organization
could have their own set of requirements and enforce them through said
mechanisms.

I would like to provide a sane default setting that does not require any
sophisticated additional setups, and that, as expressed multiple times,
includes a runbook.

Nothing really new is happening in this discussion. All arguments are made.

It boils down to two questions:

(1) Do we want to have a fixed set of special fields that are typically
used in notifications, or do we make it all free-form key-value entries in
the payload?

(2) If the answer to (1) is yes, what is that set of special fields? (I
assume we all gree that none of the fields should be mandatory.)

(There is a 3rd question, which we postponed until somebody wants to
implement something that needs it:
(3) If the answer to (1) is yes, do we still want to enable the user to add
key-value entries to the payload?
)

The problem is that the answer to (2) will influence (1), i.e. if somebody
is satisfied with the outcome of (2), they will be happy with having fixed
fields, and vice versa.
I even see a dependency of (1) on (3) because a separate payload section
(by whatever name it finally goes) makes the DESCRIPTION and SUMMARY lines
look weird).

The interdependency of the question makes simple voting difficult. But
perhaps by listing the answers of everybody so far, we can get some clarity
of direction. My try to summarize just the answers without any of the
arguments, as they are all listed in this issue already (will use BR, BB,
JV, FR to mark answers of the four people involved so far - I have
solicited FR's answers in person):

(1) BB: Fixed. JV: Fixed. BR: Happy with fixed if it includes RUNBOOK but
trending towards free-form given answers to (2)/(3). FR: free-form.
(2) BB: description/summary. JV: ?. BR: description/summary/runbook. FR:
fine with both, description/summary with or without runbook.
(3) BB: Only once a non-vaporware use-case shows up. JV: ?. BR: expects to
be required eventually anyway. FR: no (but strongly wants payload,
therefore answer (1)).

Please correct/add where necessary.

Björn Rabenstein, Engineer
http://soundcloud.com/brabenstein

SoundCloud Ltd. | Rheinsberger Str. 76/77, 10115 Berlin, Germany
Managing Director: Alexander Ljung | Incorporated in England & Wales with
Company No. 6343600 | Local Branch Office | AG Charlottenburg | HRB 110657B

To fill out places where there are question marks for me :)

(2) JV: description/summary, but can be convinced that everything free-form is the way to go if we don't hardcode anywhere, neither in the source (alerting rule) nor in the destination (Alertmanager template and notification mechanism). Meaning, I don't want a hardcoded "summary" label that a notification mechanism expects, but have it user-entered on the other end (the alerting rule). Hardcoded fields should be statically verifiable, otherwise it should be all user-configurable / templatable and it's all in the user's responsibility realm.

(3) JV: no current strong opinion / thoughts.

That a big presumption. Didn't work at Google. Won't work at SoundCloud. I
want something better.

At Google our mutual team did exactly that and I'd say it worked quite well, as almost all non-trivial alerts had playbooks without them being specified in the alerting rules.

That was primarily due to the agreed practice of having playbooks for all alerts, rather than any technical aspect (the technical assistance to make this easier to follow came much later).

_Any_ partitioning is somehow arbitrary, and while the summary-description
partition works well for emails (arguably the _worst_ notification mode for
alerts), it's not a fit for SMS (which only want a short form, i.e. the
summary) or chat messages (which only want the long form, i.e. description).

Summary/description is a good fit for SMS - as you only take the summary. Chat messages vary depending on the exact medium (IRC -> summary, Slack/hipchat - summary or summary+description).

I don't think you'd ever use just the description, as then you'd have duplication when you've both present.

But then the same is true for description and summary. Every organization
could have their own set of requirements and enforce them through said
mechanisms.

Giving organisations the option to choose how to configure that would I suspect cause more problems than it solves, as they're broadly notification method specific rather than organisation specific.

As above summary/description map reasonably well to the various notification methods we support as they all have some short and/or long form aspects.

Meaning, I don't want a hardcoded "summary" label that a notification mechanism expects, but have it user-entered on the other end (the alerting rule). Hardcoded fields should be statically verifiable, otherwise it should be all user-configurable / templatable and it's all in the user's responsibility realm.

This is why I'm for summary/description being special. The alternative is nasty from a usability and maintainability perspective.

I assume we all agree that none of the fields should be mandatory.

This appears to be correct per #1043. Alertname is the only thing that absolutely needs to be mandatory.

Please correct/add where necessary.

Seems about right.

BR: expects to be required eventually anyway. FR: no (but strongly wants payload,
therefore answer (1))

Are particular reason you're expecting this to come along? My experience tells me that the description field is quite sufficient for all use cases in this area.

My Google experience was the following: Especially old-hands and some teams in general were very meticulous with the Wiki. Others were not. So scripts were written to find alerts and flag those without wiki pages, with varying success... Other teams preferred sites, or Google docs, ... and some had more generic runbooks covering whole groups of alerts. (I'm not saying it's a good practice, but it's common.) The percentage of pages I got with "wiki not found" and I had no clue where to even start was just too damn high.

But I already said that I'm speaking from the experience from two reasonably large organizations. I will not repeat myself and shut up now. Everything is already answered somewhere above, really.

Only the one: Most cases I have seen use DESCRIPTION as a long form of SUMMARY. Above Brian recommends to not do that but to use DESCRIPTION as an extension to SUMMARY. Seems to be a common misunderstanding, and shows how the partition is not foolproof. I'd argue that it would be way more robust to only have one field then, of which the first sentence (up to the 1st period or newline or something) but not more than x characters are the SUMMARY and the rest is DESCRIPTION (similar to git commit messages).

Anyway, nothing fundamentally new.

The percentage of pages I got with "wiki not found" and I had no clue where to even start was just too damn high.

My issue was that where runbooks did exist (and they usually did), they tended to shed no light on the problem at hand. A big reason I have very strong opinions on many aspects of monitoring is continuously having to deal with situations where all I effectively have to work from is an alert expression and the source code.

If we're going to have a special field runbook isn't the one to look as it doesn't guarantee that it's in any way useful - it's just a box ticking exercise to many that doesn't aid the goal of alert resolution. Something along the lines of the alert wiki template we had would be better, as that was a good structure though even then most weren't great.

I'd argue that it would be way more robust to only have one field then, of which the first sentence

That's an option, though that's how the subject/body split was done in email notifications and it always seemed to a little confusing. It's also messier in our use case as you might want a summary, some labels/metadfata and then the description which any auto-split will mess up.

From the responses, I derive the following sentiments for the various solution I could see now:

| Suggestion | :smile: | :neutral_face: | :cry: |
| --- | --- | --- | --- |
| only free-form key-value payload (*) | FR BR | JV | BB |
| description/summary, no payload | BB JV | FR | BR |
| description/summary/runbook, no payload | | BR FR | BB JV |
| description/summary + payload | JV | | BB BR FR |
| description/summary/runbook + payload | BR | | BB JV FR |

(*) implies no hardcoding of description/summary/runbook anywhere, everything is just parameters used by templates.

In summary, any solution with a mix of free-form key-value pairs and fixed values doesn't get much love (for different reason), so we kind of gravitate towards the opposing ends of the spectrum: either go back to just description/summary without any additional payload items configurable by the user, or go all-in and have only payload items.

The vote seems exactly split. We could try to grade the solutions against each other on a finer scale. I would like to argue, though, that the all-in payload solution would at least be stable, while the description/summary solution would degrade into one 75% hate once we _have_ to introduce user-configurable payload.

while the description/summary solution would degrade into one 75% hate once we have to introduce user-configurable payload.

I'm okay with introducing user-configurable payloads + summary/description once the use case arises. I'd rather not do so prematurely, as in the best case it'll cause confusion about the two types of labels and I'm not convinced yet that we'll end up needing it.

Updated table:

| Suggestion | :smile: | :neutral_face: | :cry: |
| --- | --- | --- | --- |
| only free-form key-value payload () | FR BR | JV | BB |
| description/summary, no payload | BB JV | FR | BR |
| description/summary + payload | JV BB(
*) | | BR FR |
| description/summary/runbook, no payload | | BR FR | BB JV |
| description/summary/runbook + payload | BR | | BB JV FR |

(*) implies no hardcoding of description/summary/runbook anywhere, everything is just parameters used by templates.

(**) once the need for payload is proven.

@brian-brazil

The first use case I already see is taking automatic action based on alerts. Let's not get into how dangerous this is and extremely hard to get right. But internally and externally this has been a very desired use case.

There will be demand for it and I'm confident that there will be cases where smart people can apply this in a safe manner. Custom payload is a very obvious way to pass parameters to a system handling such things.

We are advocating generic solutions that give the user a lot of (potentially dangerous) power everywhere – relabeling being a very extreme example of this. I just find it consistent to continue this pattern in Alertmanager.

The first use case I already see is taking automatic action based on alerts.

Why does that require non-grouping labels as compared to normal labels? I've always seen normal labels used without issue for such use cases.

To me it gives a clearer separation between labels that are describing my
alert and parameters I'm providing as input to another system.
More importantly, we would support templating in payload values. I haven't
seen such a system at a practical scale so I don't know how important that
is – but it might be. For example, summaries and descriptions are also just
input to another system in which we exactly want that.

I'm curious why we would go from "all power to the user, because there
might be a however-unlikely use case" to "no one should need that,
probably".

You don't want the runbook field, Bjoern wants it. You both feel strong
about it – we are going to have payloads if we want to find a solution
not completely dissatisfying either of you.

On Fri, Sep 25, 2015 at 12:38 AM Brian Brazil [email protected]
wrote:

The first use case I already see is taking automatic action based on
alerts.

Why does that require non-grouping labels as compared to normal labels?
I've always seen normal labels used without issue for such use cases.


Reply to this email directly or view it on GitHub
https://github.com/prometheus/prometheus/issues/1002#issuecomment-143068023
.

To me it gives a clearer separation between labels that are describing my
alert and parameters I'm providing as input to another system.

I'm not sure I really see that distinction, they're all input to the alertmanager at the end of the day.

The two technical aspects we need to worry about are what labels are used for grouping (which we'll likely need to configure at a prometheus level, as they'll usually be global labels), and do we want to fully support alerts where labels aren't constant relative to a particular instance of an alert beyond summary/description. Payload only helps with the latter, and there's knock on issues as you then can't route based on them.

More importantly, we would support templating in payload values.

We'll be supporting templating in all the alert labels and fields, it comes up in some advanced alert routing use cases - and sometimes you just want to tweak your alert labels and templates are easier than complex promql label operations. I've got someone working on this from the hackaton a bit back.

It doesn't come up for automatic alert-based actions in my experience. What you need there is a static label (which must be a normal label so it can be easily polled and routed) indicating that this is an alert the automated system needs to handle, and the usual labels provide the rest of the information.

I'm curious why we would go from "all power to the user, because there might be a however-unlikely use case" to "no one should need that, probably".

We are advocating generic solutions that give the user a lot of (potentially dangerous) power everywhere – relabeling being a very extreme example of this. I just find it consistent to continue this pattern in Alertmanager.

We're not advocating generic solutions that require the user to deal with lots of power, we're trying to keep things simple while ensuring that it's possible to handle advanced use cases users will meet as they grow.

Relabelling is a pragmatic design choice. Where every single user effectively has a different use case, we can't sanely do anything other than an extremely generic solution. This is not at all great for usability, but I haven't seen any other good options suggested. It's far better than what'd happen if we'd tried to support all these use cases via configuration options, we'd probably already be buried in spaghetti code and the associated user docs would be impenetrable.

Alerting doesn't have this problem, the domain of use cases is far smaller. What we have (with notification templates and non-grouping labels) should be sufficiently generic to handle all reasonable use cases, while still being usable out of the box for the vast majority of users. Accordingly there's no need to take the usability hit of adding more genericness on top of that. With fully generic payloads, alert notifications could no longer work well out of the box and it'd also be a major hit to hopes of ever providing standard alerts.

I'm quite confident that without payloads we can handle all reasonable use cases, as in all my years I've never come across a use case that'd require it. If I'm wrong we can always add it later.

we are going to have payloads if we want to find a solution not completely dissatisfying either of you.

I've gone over this above, that payloads aren't required for a solution to the runbook problem - normal labels are sufficient. Runbooks don't require the properties that those would have, which is why I want to wait for a good use case that actually requires payload as there's a significant hit to usability inherent to adding it.

If the goal of having runbook 1st class is education, I don't a big difference between

with { 
 severity="page"
 runbook="http://myrunbook"
}

and

with { 
 severity="page"
}
runbook "http://myrunbook"

They both seem about the same to me from the standpoint of teaching others to fill it out. Is there something I'm missing?

Those "labels" have no meaning whatsover regarding what the alert "is".
Updating a runbook link for a firing alert shouldn't spawn an essentially
new alert. Just as changing description and summaries doesn't do that. That
can hold true for lots of other data.

We'll be supporting templating in all the alert labels and fields

Even though I'm open to this I stated concerns about this before. Curious
who decided we are going to do it.

Of course alert notifications could still work out of the box. Just say
"description" payload is a field that is always interpreted by AM in a
certain way. That's what we are doing for a ton of labels, which have very
specific meanings and are even required in some cases.
To me there is no barrier with having it in indented by two spaces and in
curly braces – people can deal with a lot worse from configuring their
targets.
If we then add payloads later (I think someone will have a use case, we can
either reject them or implement), we just have mix which we agree to be a
very odd user experience.

On Fri, Sep 25, 2015 at 2:52 AM Brian Brazil [email protected]
wrote:

To me it gives a clearer separation between labels that are describing my
alert and parameters I'm providing as input to another system.

I'm not sure I really see that distinction, they're all input to the
alertmanager at the end of the day.

The two technical aspects we need to worry about are what labels are used
for grouping (which we'll likely need to configure at a prometheus level,
as they'll usually be global labels), and do we want to fully support
alerts where labels aren't constant relative to a particular instance of an
alert beyond summary/description. Payload only helps with the latter, and
there's knock on issues as you then can't route based on them.

More importantly, we would support templating in payload values.

We'll be supporting templating in all the alert labels and fields, it
comes up in some advanced alert routing use cases - and sometimes you just
want to tweak your alert labels and templates are easier than complex
promql label operations. I've got someone working on this from the hackaton
a bit back.

It doesn't come up for automatic alert-based actions in my experience.
What you need there is a static label (which must be a normal label so it
can be easily polled and routed) indicating that this is an alert the
automated system needs to handle, and the usual labels provide the rest of
the information.

I'm curious why we would go from "all power to the user, because there
might be a however-unlikely use case" to "no one should need that,
probably".

We are advocating generic solutions that give the user a lot of
(potentially dangerous) power everywhere – relabeling being a very extreme
example of this. I just find it consistent to continue this pattern in
Alertmanager.

We're not advocating generic solutions that require the user to deal with
lots of power, we're trying to keep things simple while ensuring that it's
possible to handle advanced use cases users will meet as they grow.

Relabelling is a pragmatic design choice. Where every single user
effectively has a different use case, we can't sanely do anything other
than an extremely generic solution. This is not at all great for usability,
but I haven't seen any other good options suggested. It's far better than
what'd happen if we'd tried to support all these use cases via
configuration options, we'd probably already be buried in spaghetti code
and the associated user docs would be impenetrable.

Alerting doesn't have this problem, the domain of use cases is far
smaller. What we have (with notification templates and non-grouping labels)
should be sufficiently generic to handle all reasonable use cases, while
still being usable out of the box for the vast majority of users.
Accordingly there's no need to take the usability hit of adding more
genericness on top of that. With fully generic payloads, alert
notifications could no longer work well out of the box and it'd also be a
major hit to hopes of ever providing standard alerts.

I'm quite confident that without payloads we can handle all reasonable use
cases, as in all my years I've never come across a use case that'd require
it. If I'm wrong we can always add it later.

we are going to have payloads if we want to find a solution not completely
dissatisfying either of you.

I've gone over this above, that payloads aren't required for a solution to
the runbook problem - normal labels are sufficient. Runbooks don't require
the properties that those would have, which is why I want to wait for a
good use case that actually requires payload as there's a significant hit
to usability inherent to adding it.

If the goal of having runbook 1st class is education, I don't a big
difference between

with {
severity="page"
runbook="http://myrunbook"
}

and

with {
severity="page"
}
runbook "http://myrunbook"

They both seem about the same to me from the standpoint of teaching others
to fill it out. Is there something I'm missing?


Reply to this email directly or view it on GitHub
https://github.com/prometheus/prometheus/issues/1002#issuecomment-143093231
.

Updating a runbook link for a firing alert shouldn't spawn an essentially
new alert. Just as changing description and summaries doesn't do that. That
can hold true for lots of other data.

That's true. I expect such updates to be quite rare though (how often do you rework your runbook layout? Every few years maybe?), and alerts usually aren't firing so it shouldn't be a problem in practice. Other sources of label churn such as changing target label structure will cause problems more often, and that one can't be avoided.

I'd be looking for something to churns in a problematic fashion at least monthly, and I'd expect any use cases to be more on the hour/minute level. If we can find one of those that we don't have another solution for, then I'm good with adding payloads.

One that just came to mind is alert routing based on time, which I think would be better done in the alertmanager itself or $PagerDuty. Actually it might have to be done in alertmanager, hmm...

Even though I'm open to this I stated concerns about this before. Curious
who decided we are going to do it.

I thought we had consensus on that one. I agree with you that we should be running the template parser on these at rule load time.

Of course alert notifications could still work out of the box. Just say
"description" payload is a field that is always interpreted by AM in a
certain way. That's what we are doing for a ton of labels, which have very
specific meanings and are even required in some cases.

If description is fully generic then I agree with @jrv it needs to be fully generic, and we can't specify such. The only special labels will be job and instance, I don't think we have any others.

Let's not look too closely at the current prometheus->alertmanager protocol, it needs cleanup.

we just have mix which we agree to be a very odd user experience.

It's not that odd, it's just tricky to explain to a beginner because it involves concepts well beyond what they should have to care about at that stage.

Thinking on the time case and semantics, what we might end up doing is rather than having a PAYLOAD keyword as part of alerts is to have them as normal labels in the alert statement and then some global configuration option for which labels to get the payload behaviour (non-grouping, don't appear in ALERTS/remote storage/federation, don't affect alert identity in prometheus). This would line up with how the non-grouping labels will have to work.

Yes, Brian, there is something you are missing. It's already spelled out here in the discussion.
But I'm not going to run in loops ever more.

I think the key is the question if there is a use case for what we call "payload" (for sure, it should have a different name in the final syntax of alert expression, but that's a different story we can think about later).

In fact, there have been use-cases mentioned here, even if Brian dismisses or ignores them.

  1. Alertbook links containing templated parts. This is a concrete use-case I worked with reasonably often at Google.
  2. “Enriched” alerts of any kind. I.e. you want some kind of additional information in a Slack notification or email, an embedded image or graph. It's not something somebody would be working on implementing right now, but it was sketched out here at SC after the need was expressed, and there are similar tools for Nagios alerts.
  3. Anything you _could_ put in a label because it does not change for every alert instance but you do not _want_ to put there because it changes occasionally and convoys the wrong semantics as a label.

From my perspective, these use cases are way more concrete than that mysterious "advanced routing" case that requires templating in labels.

To be clear, I'm not against templating in labels, but it's definitely something many users will use to shoot themselves in the foot, which on the other hand will be way harder with user-defined payload. Or, FWIW, a 1st class runbook entry... So I don't really understand the resistance against the user-defined payload or the runbook entry while at the same time going with templated labels without remorse for a pretty esoteric use-case that could even be implemented with normal relabeling...

As concluded before, a 1st class runbook entry has already been ruled out, so we don't really need to discuss pros and cons of that anymore. It's between (1) go fully in on generic payload or (2) go back to the old state (description/summary) with the option of adding user defined payload when needed.

I see the use-case for a user-defined payload already proven (especially as we won't have a dedicated payload field, see item (1) above). So we _will_ introduce user-defined payload eventually, I'm very sure of that. So the real decision is between line 1 and 3 in the solutions table I have created yesterday. Between those two, the generic approach has more love.

On Fri, Sep 25, 2015 at 12:41 PM, Brian Brazil [email protected]
wrote:

Thinking on the time case and semantics, what we might end up doing is
rather than having a PAYLOAD keyword as part of alerts is to have them as
normal labels in the alert statement and then some global configuration
option for which labels to get the payload behaviour (non-grouping, don't
appear in ALERTS/remote storage/federation, don't affect alert identity
in prometheus). This would line up with how the non-grouping labels will
have to work.

That sounds like a good idea to think about. It would hide the arcane
decision from the beginners, but would give advanced users the change.
Everything would be in the WITH clause, which satisfies my desire for clean
syntax (and not marking a runbook link as an "odd one out"). We could have
a reasonable default for the global setting.

Björn Rabenstein, Engineer
http://soundcloud.com/brabenstein

SoundCloud Ltd. | Rheinsberger Str. 76/77, 10115 Berlin, Germany
Managing Director: Alexander Ljung | Incorporated in England & Wales with
Company No. 6343600 | Local Branch Office | AG Charlottenburg | HRB 110657B

Yes, Brian, there is something you are missing. It's already spelled out here in the discussion.

I'm sorry if I'm missing it, you're presenting many arguments and lots of points of varying relevance. I have no idea what exactly you're referring to here.

Alertbook links containing templated parts. This is a concrete use-case I worked with reasonably often at Google.

From my experience at Google, we don't need payload to handle this as things worked fine without it.

“Enriched” alerts of any kind. I.e. you want some kind of additional information in a Slack notification or email, an embedded image or graph.

Similarly I suspect we don't need payload to handle this. Nothing went very far in this direction though as IRC doesn't go graphs.

Anything you could put in a label because it does not change for every alert instance but you do not want to put there because it changes occasionally and convoys the wrong semantics as a label.

Hmm, I wonder if there's an anti-pattern of putting too many bits of metadata in an alert similar to having inappropriate labels on a target? I guess we'll find out in time.

To be clear, I'm not against templating in labels, but it's definitely something many users will use to shoot themselves in the foot, which on the other hand will be way harder with user-defined payload.

Agreed that many will shoot themselves in the foot - though I expect usage to be uncommon. I expect even more to shoot themselves in the foot with a payload due to not understanding the distinction between the two classes of labels. The idea of the alertmanager routing will be complex enough already, without adding the concept of all three classes of labels to it.

So I don't really understand the resistance against the user-defined payload or the runbook entry while at the same time going with templated labels without remorse for a pretty esoteric use-case that could even be implemented with normal relabeling...

I'm against payload as it's seems to be unnecessary complexity at the moment. I'm against runbook as it doesn't fit conceptually - I could come up with several related fields that I could use the same justification for as you're using for runbook and some of which I consider more important than runbook. None of these have the technical justification that summary/description do.

While it's in theory possible to do the same as templating with promql (I think?), you're entering Turing tarpit territory for some of the simpler use cases (the advanced use cases are simple things like for loops to build a list of people to email/page), as string manipulation is not what promql is designed for.

I see the use-case for a user-defined payload already proven

I dismiss your use cases as are they are currently vague. If you can provide a solid use case I'll change my mind, but right now I have strong evidence from experience that this feature isn't needed and no use cases, so I think it's a stretch to say it's proven.

That sounds like a good idea to think about. It would hide the arcane
decision from the beginners, but would give advanced users the change.

Yeah, I'm liking it for that reason.

Everything would be in the WITH clause, which satisfies my desire for clean
syntax (and not marking a runbook link as an "odd one out"). We could have
a reasonable default for the global setting.

Everything except description and summary would be in the with clause. As explained above these map well to notifications, and it won't be possible to generate good notifications out of the box without them.

On Fri, Sep 25, 2015 at 2:01 PM, Brian Brazil [email protected]
wrote:

Everything except description and summary would be in the with clause. As
explained above these map well to notifications, and it won't be possible
to generate good notifications out of the box without them.

I would still strongly prefer to have description and summary in the WITH
clause, too. Otherwise, we are back at the beginning with runbook. To mock
the style of our discussion: "As explained above, RUNBOOK maps well to
notifications, and it won't be possible to generate good notifications out
of the box without it."

I guess, should we agree on the "everything in WITH" approach, we would
then have the discussions which keys should be on the default "go to
payload" list, which needed to be at least "description, summary, runbook"
IMHO, but we would again have "magic" strings hardcoded in the default.
Which is my main concern about the "only WITH" approach.

WRT use-cases for payload: Once more, we have to agree to disagree. I
disagree with your assessment of the use-cases and continue to find them
proven and valid. However, I don't see any point in repeating the same
arguments already provided or to iterate even deeper in explaining why we
consider which arguments more or less relevant or valid. It looks we are
not able to convince each other.

Björn Rabenstein, Engineer
http://soundcloud.com/brabenstein

SoundCloud Ltd. | Rheinsberger Str. 76/77, 10115 Berlin, Germany
Managing Director: Alexander Ljung | Incorporated in England & Wales with
Company No. 6343600 | Local Branch Office | AG Charlottenburg | HRB 110657B

"As explained above, RUNBOOK maps well to
notifications, and it won't be possible to generate good notifications out
of the box without it."

That's not a good analogy, no notification method we currently support has something that looks like a runbook field.
It might map well within your organisation, but as with relabelling that doesn't mean it gets explicit support if you can handle the use-case via the more generic methods that are provided.

we would again have "magic" strings hardcoded in the default.
Which is my main concern about the "only WITH" approach.

I share that concern, to the extent that I believe that once you work through the implications that it's impractical from a usability standpoint. The minimum to avoid that is that summary and description must be kept separate.

I disagree with you assessment of the use-cases and continue to find them
proven and valid.

Can you provide a fleshed out example of where this would be required?

On Fri, Sep 25, 2015 at 3:29 PM, Brian Brazil [email protected]
wrote:

"As explained above, RUNBOOK maps well to
notifications, and it won't be possible to generate good notifications out
of the box without it."

That's not a good analogy, no notification method we currently support has
something that looks like a runbook field.

I already said that this is a bug to be fixed.
But I wasn't planning to start the loop again. I tried to mock our style of
discussion. As a response, you continued with the style of discussions I
tried to mock.

I share that concern, to the extent that I believe that once you work
through the implications that it's impractical from a usability standpoint.
The minimum to avoid that is that summary and description must be kept
separate.

I have an idea how to deal with it in a neat way, but I think we should
discuss separately as it doesn't really touch the two decisions we have to
make right now: Do we need user-defined payload in the first place? Do we
want description and summary to be special enough to be hardcoded
throughout the stack?

I disagree with you assessment of the use-cases and continue to find them
proven and valid.

Can you provide a fleshed out example of where this would be required?

But I did. I described a real life use-case I found extremely relevant. You
declared it as irrelevant. I disagree with that assessment. Yes. No. Yes.
No. Back in the hamster wheel.

Björn Rabenstein, Engineer
http://soundcloud.com/brabenstein

SoundCloud Ltd. | Rheinsberger Str. 76/77, 10115 Berlin, Germany
Managing Director: Alexander Ljung | Incorporated in England & Wales with
Company No. 6343600 | Local Branch Office | AG Charlottenburg | HRB 110657B

I already said that this is a bug to be fixed.

That most implementations of notification methods don't currently render runbook is not that I'm referring to.

That no notification providers have a field called "runbook" or "playbook" or similar is what I mean here (e.g. email has subject and body; hipchat has color, message, notify and message_format). Put another way runbooks aren't an abstraction that helps us to provide good defaults for the various notification providers, whereas summary/description do (mapping to subject, body and message above).

If a significant number of notification providers did have a field where you'd think "that'd be a natural place to put a runbook" then you'd have a good argument on these grounds.

Do we need user-defined payload in the first place?

Maybe, and the global config method is probably the best way to do it if/when it comes up.

Do we want description and summary to be special enough to be hardcoded
throughout the stack?

Yes.

I described a real life use-case I found extremely relevant.

Which use case was that?

For real-life use cases I see the following:

  • "I want a notification template that creates big and fat red letters "MISSING RUNBOOK" if there is no runbook." which does not require payload as it's a notification template feature.
  • "I want a way to automatically identify alerts that have no runbook." which does not require payload, it's more a offline tooling/AST parsing issue than a runtime one.

Stretching a little in terms of solidness "Alertbook links containing templated parts." also doesn't require payload as labels and/or description cover that depending on the exact details.

I was talking about use-cases for user-defined payload. For me and Fabian it's obvious that we'll need it. Julius seems to be undecided, Brian doesn't see a concrete use-case yet.

Thanks for explaining what you meant when referring to the runbook field in notification mechanism.

First of all: Obviously, if I ran pagerduty, there would be a big and fat runbook field. Good parts of Prometheus is about finally making things right that had been done wrong all the time. But yeah, if I'm the only one in the Prometheus core team thinking every alert must have a runbook, I'll have a hard time.

More important, though, is that it's not really a direct mapping. In fact, we are always processing summary and description in some way, adding other text or the (non-user provided) payload, concatenating or dropping. We essentially have a level of templating in between, currently not configurable templating, but we'll introduce that. What I don't want is to drag that templating into an omnipotent DESCRIPTION field, where it has to be repeated for every alert. We certainly need some level of templating in the alert expression itself, but that should be alert specific and not repeat from alert to alert all over.

I was talking about use-cases for user-defined payload. For me and Fabian it's obvious that we'll need it. Julius seems to be undecided, Brian doesn't see a concrete use-case yet.

From my standpoint I think that if we're going to add it we should first add alert notification templates, wait a few months for people to get used to their power, and then add the feature.

Adding it before it's needed is risky, and in the best case would result in users tying themselves in knots by applying more of the generic options we offer than they need. Every feature we have is a potential liability in that it's something a user can mess up or fixate on, and most of them will give up before they get as far as our IRC channel.

Good parts of Prometheus is about finally making things right that had been done wrong all the time. But yeah, if I'm the only one in the Prometheus core team thinking every alert must have a runbook, I'll have a hard time.

I'm strongly in favour of pointing users down the right path, but we need to balance that so we don't scare away or irritate users. There's no point doing all the right things in such a way that there's so much cognitive load that users balk, as then we have have failed in our goal of encouraging users to do the right thing.

Having a useful runbook is a good practice and we should encourage that, I don't think an alert expression field as currently implemented is a productive way to go about that. It adds friction to creating alerts which is already tricky enough as-is due to the power of promql and the data model. With the runbook field either the user already cares and is happy that it seems to help, or they don't care and are frustrated by having to spend time thinking about the unnecessary field - converts to using runbooks are going to be very rare via this approach.
This is why I think even description and summary should be optional (and I'm mulling over your idea of merging them too). We want users focused on the expression and writing useful alerts when writing their first 10-20 alerts, and giving them an easy way to get a bit of text attached as that's also common. It's not the time for the HA options and oncall theory, that all comes later.

I'd be thinking more along the lines that our alert notification examples would include "Runbook: http://mywiki/{{.Alertname}}" and use that as a selling point of how powerful the alertmanager is while also hooking users into the idea of runbooks by saving them having to configure it per-alert. It's a bit of a bait-and-switch - they never wanted runbooks, but it's so cool and easy to do that they'll apply it. Now the question isn't "what is this field, why would I fill it out?" but rather "why doesn't this link work", and that's a much stronger place to work from - and the user has a working prometheus+alertmanager setup so they're more likely to stick around and pick up more best practices.

Just having the field isn't enough, it needs to be part of a broader narrative about how Prometheus is powerful/easy to use and thus makes our user's lives easier. Hmm, that'd make a nice blog post.

More important, though, is that it's not really a direct mapping. In fact, we are always processing summary and description in some way, adding other text or the (non-user provided) payload, concatenating or dropping. We essentially have a level of templating in between, currently not configurable templating, but we'll introduce that.

Yeah, this will all be clearer once we have the notification templating.

What I don't want is to drag that templating into an omnipotent DESCRIPTION field, where it has to be repeated for every alert.

That's a valid concern. I don't consider it a problem in the runbook case as that should be managed in a notification template rather than having humans copy&paste into every alert - but the other cases you're hinting at are more interesting. Notification templates will let you not have to repeat lots of stuff for every alert if the repetition would be in the description.
Where I see potential use cases arising is more where there's automated processing going on, and I'm reserving judgement there until we have notification templates and I see users starting to drop non-trivial amounts of metadata into description for which we don't have a better solution (my suspicion is that from a general design standpoint such use cases are better handled by polling prometheus than getting notifications from the alertmanager).

We certainly need some level of templating in the alert expression itself, but that should be alert specific and not repeat from alert to alert all over.

Yeah, if you're copying the similar stuff from alert to alert then there's probably a better way to do that. Likely by moving whatever it is into the notification template.

Many words have been spilled, I shall add some more. Here are my thoughts on this:

I would prefer a "not so special" collection of arbitrary fields to be passed to Alertmanager (or any other webhook system integrators might want). In this sense, summary, description, and runbook can and should be fields among many.

I feel very strongly that these are _not labels_. That would be horribly confusing, conceptually and in the UI. A label is a dimension, with a finite set of values, for aggregation and routing. A summary or a description are information attached to a particular occurence of an event, not something that groups or spans events. For reference, Kubernetes has _labels_ for grouping and identification (as Prometheus), and _annotations_ to attach arbitrary data to objects, under more or less well known keys (but optional to consume).

However, given a complete blank slate, what should AlertManager show? To reasonably display alerts in the UI, AM needs to know _something_ about the structure of what it's getting. It could simply expect a fixed set of annotations, but then we have gained nothing and made the expectation oblique on the Prometheus side. We would fall into the UNIX trap of providing too little structure and making sense of it ad hoc and inconsistently.

Should AlertManager expect nothing? But we want a new user to get up and running with a reasonable setup without hassle.

As a "middle way" that preserves full flexibility for advanced setups, works out of the box, but has no implicit dependencies, I propose the following:

AlertManager allows templating bits and pieces of display (for example, a "title", "short" and "long" display option defined as Go templates). They are used in the UI and notification templates (to reduce the need to duplicate). This is kind of in line with the consoles in Prometheus. It _ships_ with a set of simple templates for these fields, which _do_ expect certain annotations (say, subject and body), similar to how the consoles expect certain metrics and labels.

On the Prometheus side, no annotation is treated specially, but all examples and documentation include these expected-by-default fields, and link to the AlertManager documentation for the templating, which lays out the dependencies of the default templates.

This makes the integration _structured_ and _flexible_ at the same time – out of the box, by taking and modifying Prometheus examples, you will have a reasonably working setup immediately. It is clearly documented what the assumptions of this setup are. Advanced installations (like SoundCloud) have the option to extend or completely change which fields are expected and what they mean. We could even contribute a blog post that demonstrates how to do this in practice with the example of runbooks.

This seems like a good analysis, and a good approach for experts who are familiar with the challenges of managing this type of system.

Something to keep in mind is that that the features Prometheus offers and that the Alertmanager will offer are far beyond most other things out there in terms of power and flexibility. This is our strength, but also our weakness. We should take care when doing things that make the learning curve steeper.

Thus I worry that the approach you propose is far too generic to expose to someone new to the system who just wants to send an alert. A potential user used to something like Nagios and Graphite is going to have challenges with comprehending Prometheus generally, and doesn't care about high minded notions like annotations vs labels. You're proposing to link them off to the relevant reference docs. This would cause confusion for beginner users as we point them towards the full complexity of the alerting system, and lead some of them to shoot their foot off.

I don't think that's at all a good tradeoff, and in the worst case could end up with us having a reputation of having an alerting system that's too arcane to use.
As a matter of practicality we need something summary/descriptionish so that we don't have to expose users to what is unnecessary complexity for most use cases.

To give you an idea of the way I'm thinking: With #1043 I'm hoping to be to remove all mention of summary and description from beginner examples. This will make things even easier to get off the ground with reasonable notifications than it is today. This doesn't work with your proposal as now summary and description are now effectively mandatory again in docs.

Every line of config and every concept we have to include in examples and docs reduces adoption (think Java's hello world vs Python, one of those takes a semester to understand). We're a new project in a very active space, we're not in a position where we can risk losing potential users for the sake of formal correctness alone.

So with #1043, someone new to Prometheus won't see either summary or description. With that premise, why would not having these two fields as first class attributes change anything in adoption for beginners? Beginners would not get into contact with these fields in either way. As soon as they want to customize alerts further, they'll have to consult the documentation anyway. I don't see why the specific syntax used (first class attributes vs. just labels) will make any difference or make things anymore confusing.

:+1: to treat all three fields the same.

With that premise, why would not having these two fields as first class attributes change anything in adoption for beginners?

With Matthias's proposal they would see have to see those attributes in all examples, so that does change something in adoption.

If we ignore that bit, it causes a problem when they to to the next stage beyond simple alerts where they want to tweak the notification a little bit. Now rather than having a field that likely covers their need, they've instead got theory to learn and choices to make. More concretely rather than a user thinking "I'll add it to description" they now think more along the lines of "should I put this in description, or a label or in payload and what do I name them and how do I lay out my notification templates".

Similarly if they want to do something simple with routing (which is a common initial use case) they have to learn that only labels work for that, and we'll start to get questions and confusion around why payload doesn't as syntactically they're identical in the alert expression. Most beginner alerts wouldn't even have a "WITH" clause, so users would think that "PAYLOAD" was the place to put labels for routing as it looks like a place to put labels. Trying to prevent that requires more docs, which means more reading and confusion, which means less adoption.

As soon as they want to customize alerts further, they'll have to consult the documentation anyway. I don't see why the specific syntax used (first class attributes vs. just labels) will make any difference or make things anymore confusing.

Yes, the question is at what stage we require them to consult the docs and how complex those docs are. With Matthias's proposal they have to read them much sooner and they're more complex. I'd like to push out the need to read the full docs as far as possible and try to limit how may concepts docs have to explain, particularly where they're only really relevant to the more advanced use cases.

If you look at target relabelling, we're already seeing these type of problems due to the system being too generic and difficult to learn for a new user. While I still believe that is the least-worst decision in that case, let's not introduce another one of those into the system without being sure it'll overall lead to an increased number of happy users. What's being proposed here is a feature that's mainly of use to advanced users, we shouldn't be forcing beginner and intermediate users to have to deal with complexity that doesn't apply to their use case when we've a reasonable way to avoid that.

I'll admit that I haven't read this whole thread, but I've read a large portion of it. Let me offer my perspective as a naive, basic user.

When I started looking for an alerting system for work, I went through several options before settling on Prometheus. Nagios has a lot of complexity. It was hard just to get installed. Then I went to Icinga2. That was easier to setup and install, but still not great with my companies restrictive access. It then took another day to figure out that I'd still have to open a bunch of ports to get metrics out. Prometheus was dead simple to install. It requires very little effort to configure.

When I started to add alerts, the docs were very easy to follow. There are still gaps in the docs, but experience easily filled in those. However, there is something lacking. I think that is what we're covering here.

As most enterprise user, I live in a world that has both old and ancient technology. Some of it written in-house. This gives me some more complex cases with respect to the information these systems need.

Our alerting/paging system requires a long complex string to be sent as an SNMP trap or through a tcp connection. I'm currently passing this through the description, but that makes my emails look cryptic. Perhaps I'm just not architecting it correctly. This string is populated by Prometheus, but I could feasibly do this in my webhook receiver.

Either way, I feel like another option for additional payload is missing. When I first saw the runbook attribute, I had a good idea what it was, but it wasn't clear how to use it. Payload makes more sense to me assuming it's optional and the internal attributes are flexible. It also needs to support templating.

Thanks for the insight, this sounds like a concrete use case.

I'm currently passing this through the description, but that makes my emails look cryptic. Perhaps I'm just not architecting it correctly. This string is populated by Prometheus, but I could feasibly do this in my webhook receiver.

That's the way to do it currently, and as long as you're talking non-binary strings I think this is something we should probably cover within Prometheus. Once we have notification templates there'll be ways to clean up the emails (regexes), but that's obviously a bit messy.

What sort of information is in this string? I want to verify that this isn't something that will be covered by notification templates (roughly speaking, if you're not using query you will probably be able to handle this entirely in the alertmanager).

It also needs to support templating.

I think everyone is onboard with that.

Presuming this works out as a use case, the debate then is around a) how we expose this to users configuration-wise and b) keeping summary/description first class.

I haven't been able to get back to my work laptop to pull the actual
string, but I inject the application name and instance into the string. I
also add a custom message and error name. It also includes the type of
event and some general application information. The SNMP trap string is
just a standard format. Let me know if you want to see the actual format of
each.
On Sep 26, 2015 13:19, "Brian Brazil" [email protected] wrote:

Thanks for the insight, this sounds like a concrete use case.

I'm currently passing this through the description, but that makes my
emails look cryptic. Perhaps I'm just not architecting it correctly. This
string is populated by Prometheus, but I could feasibly do this in my
webhook receiver.

That's the way to do it currently, and as long as you're talking
non-binary strings I think this is something we should probably cover
within Prometheus. Once we have notification templates there'll be ways to
clean up the emails (regexes), but that's obviously a bit messy.

What sort of information is in this string? I want to verify that this
isn't something that will be covered by notification templates (roughly
speaking, if you're not using query you will probably be able to handle
this entirely in the alertmanager).

It also needs to support templating.

I think everyone is onboard with that.

Presuming this works out as a use case, the debate then is around a) how
we expose this to users configuration-wise and b) keeping
summary/description first class.


Reply to this email directly or view it on GitHub
https://github.com/prometheus/prometheus/issues/1002#issuecomment-143477451
.

@brian-brazil, here's the format for my alerts:

The event format is the word "EVENT" followed by key=value pairs separated by "~%~".  The keys you will be sending are:
    msg_class=DST_Trap
    app_host=<servername>
    app_name=<your app name>
    app_object=<reference object for alert>
    app_severity=<severity: one of CRITICAL, MAJOR, MINOR, WARNING, INFO, OK>
    app_client=<client if applicable>
    app_eventcode=<event code>
    app_eventmsg=<event message>

And this is the pertinent part of my alert rule:

DESCRIPTION "EVENT msg_class=DST_Trap~%~app_host={{$labels.instance}}~%~app_name={{$labels.job}}~%~app_object={{$labels.task}}~%~app_severity=CRITICAL~%~app_client=NA~%~app_eventcode=ANALYTICS_CALC~%~app_eventmsg=Error in analytics calculations"

I would prefer a method of passing this string as extra data as this same description is sent to the alerting system and through email and Slack which are not part of the alerting system. Alternatively, I'd be very happy with a way of passing other metadata information that can be added to the body of an email or as extra comments on Slack.

I may also need to send this string to handle any situations where I need to use SNMP instead of TCP:

1.3.6.1.4.1.2329.1.30.1 string "{{$labels.instance}}" 1.3.6.1.4.1.2329.1.30.2 string "{{$labels.job}}" 1.3.6.1.4.1.2329.1.30.3 string "{{$labels.task}}" 1.3.6.1.4.1.2329.1.30.4 string "CRITICAL" 1.3.6.1.4.1.2329.1.30.5 string "NA" 1.3.6.1.4.1.2329.1.30.6 string "ANALYTICS_CALC" 1.3.6.1.4.1.2329.1.30.7 string "Error in analytics calculations"

I'm not sure yet how I'll input the numbers, but again this doesn't look nice to a human and I'd rather pass more meaningful data. Adding more lines to the description section will just make it hard to parse out the piece I need for our alerting system.

The inevitability of user-defined payload looks more convincing than ever...
We can still discuss how to implement it syntactically to make it a nice and non-confusing as possible (I have some ideas), but please assume it will happen while discussing if description and summary should be hard-coded and keywords in the syntax.

The decision is irrelevant for the super-naive user who doesn't use description or summary at all.

For the description and/or summary using user, it's the difference between

  SUMMARY "Instance {{$labels.instance}} down"
  DESCRIPTION "{{$labels.instance}} of job {{$labels.job}} has been down for more than 5 minutes."

and

  PAYLOAD {
    summary = "Instance {{$labels.instance}} down"
    description = "{{$labels.instance}} of job {{$labels.job}} has been down for more than 5 minutes."
  }

with all the gotchas about detecting typos on different levels. (Note that PAYLOAD is probably not a good word, we need something better, or implement other ideas, but I don't want to add those to this already quite convoluted discussion.)

The concern I have is that for those who actually need payload, it's between

  SUMMARY "Instance {{$labels.instance}} down"
  DESCRIPTION "{{$labels.instance}} of job {{$labels.job}} has been down for more than 5 minutes."
  PAYLOAD {
    foo = "some template"
    bar = "another template"
  }

and

  PAYLOAD {
    summary = "Instance {{$labels.instance}} down"
    description = "{{$labels.instance}} of job {{$labels.job}} has been down for more than 5 minutes."
    foo = "some template"
    bar = "another template"
  }

The former syntax suggests that summary/description are special, but they are not special for those who need their own payload anyway.

We essentially need to weigh the degree of confusion for new/naive users by the first comparison against the degree of confusion for ambitious/advanced users by the second comparison.

I personally see the former as negligible but the latter as quite confusing and warty.

Hmm, most of what you're doing there belongs in the alertmanager rather than Prometheus I'd say as you're mainly formatting a custom notification string rather than something specific to that alert. The only wrinkles there are app_eventcode, which would be plausible as a label, and app_eventmsg which sounds like the summary. I _think_ that notification templates will cover your use case.

The SNMP one is more interesting, my suspicion is that that's so custom that it'd have to go via the webhook and the rest per the TCP string.

There's the other option of

SUMMARY "Instance {{$labels.instance}} down"
DESCRIPTION "{{$labels.instance}} of job {{$labels.job}} has been down for more than 5 minutes."
WITH  {
  foo = "some template"
  bar = "another template"
}

and configuration the the global prometheus level that foo and bar are special (which we'll have for non-grouping labels too, as those can't be configured per-alert).

The former syntax suggests that summary/description are special, but they are not special for those who need their own payload anyway.

I'm not sure about that, I'd still expect alerts using this to also go out via notifiers using summary/description in the usual fashion. Payloads are about additional information.

After 85 comments, many considerations and opinions, I now propose the following solution.

I encourage everyone to simply state :+1: or :-1: so this can function as a vote for a final conclusion. Pros and cons have been discussed extensively above.

  • The RUNBOOK field will be removed as requested.
  • The SUMMARY and DESCRIPTION fields will be removed.
  • The WITH clause will be renamed into LABELS.
  • The ANNOTATIONS clause is added to allow free-form key/value data that are not considered in the identity of an alert.
  • Values in LABELS and ANNOTATIONS are template-processed.

Example:

ALERT InstanceDown
  IF up == 0 FOR 10m
  LABELS {
    severity = "critical"
  }
  ANNOTATIONS {
    summary = "instance {{$labels.instance}} down"
    description = "instance {{$labels.instance}} has not been reachable for more than 10 minutes"
  }

The new Alertmanager will have configurable handling of annotation fields with reasonable defaults, e.g. summary and description.
The goal is no more configuration than previously for simple setups.

:+1: Yes. I've been leaning both ways for a while and unfortunately it won't be possible to make everyone happy on this issue, but this now looks like the best way forward for me.

The ANNOTATIONS clause is added to allow free-form key/value data that are is considered in the identity of an alert.

I presume you mean NOT part of the identity of the alert here, otherwise it'd be the same as labels.

For the sake of closing this off :+1:

This implies several breaking changes, which is okay as we're talking the alertmanager. If we're going to do this it's we'll also have to redesign the json that goes between prometheus and the alertmanager (basically switch what's currently outside and inside the "payload" field) and hopefully avoid having to break that interface again.

You are correct, fixed.

:+1:

I was thinking for quite some time about the mental barrier to understand the difference between labels and annotations, and I had some ideas how to hide it to the newcomer, but in the end, I came to the conclusion that clearly separating it is ultimately the best solution. A well worded explanation seems possible ("shows up in your alert timeseries" and "is part of the identity of the alert" are after all quite obvious concepts).

And yes, the JSON format has to be updated, too, but that was obvious for a while...

Yep, this change should go hand in hand with the Alertmanager rewrite, which needs a breaking of the transfer format anyways. So with the Prometheus version that introduces this, people will have to upgrade to the new Alertmanager in lockstep.

And: \o/ for reaching a decision!

Yep, this change should go hand in hand with the Alertmanager rewrite, which needs a breaking of the transfer format anyways.

Actually I think we do this one sooner, what I'd be tying this to more is the notifications templates.

If we do that, we will break Prometheus<->AM twice instead of once. Not so happy about that, I mean, it's not that urgent or is it?

I'm not sure why we'd need to break it twice, I'm not forseeing any fundamental structural changes to the interaction that'd require breaking changes, it's more a cleanup of the existing API. It'd also be good not to block notification templates on the rewrite.

From #1128 we see that users are already having problems with the lack of notification templates and working around them which is causing other issues, so I'd like to get notification templates in at least in inside the next month.

With some effort we can have a new AM by then that has the new features we want and the persistence of the current one.

I see the notification stuff as orthogonal to the new AM, the only wrinkle is the API changes this issue requires.

AIUI the new AM will require other API changes besides those fields - @fabxc will know more. If that's true, it would create a second breaking change to the transfer format.

A small compatibility layer in the AM is probably best then. No point in a breaking change in notification templates too.

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings