Alertmanager: Feature Request: Auto-expire silences upon alert resolution

Created on 25 Oct 2017  路  6Comments  路  Source: prometheus/alertmanager

Currently, we have the ability to auto-expire silences upon some time duration from when the silence was created.

However, it is not always known in advance a good estimate for how long of a duration to use (e.g. "will this issue take 30min or 3 hours to fix?").

I feel users would benefit from having the option to expire a silence once the underlying alerts suppressed by the silence resolve (i.e. give the user the option to use time-based expiration (as is done today), or to expire it once all the underlying alerts have resolved). This way a user can say "I am working on the issue and don't need any further notifications, and when I'm done putting out the fire, I don't also need to remember to go back and remove any silences I created for it".

Most helpful comment

There is a third party implementation: https://github.com/prymitive/kthxbye

All 6 comments

and when I'm done putting out the fire, I don't also need to remember to go back and remove any silences I created for it.

The existing time duration support for silences covers this. I'd usually set silences for a few hours initially, so even if I forget to remove the silence it'll not hang around for long.

In general that alerts have stopped firing does not indicate that the issue is resolved, only human investigation can do so. Doing as you suggest would result in spamming the oncall if an alert flapped.

Another use-case could be: An alert triggered, and the resolution will end up taking days (e.g. so and so from a different group will replace hardware in the next few days, at which point the service will come back online, and should be monitored).

Doing as you suggest would result in spamming the oncall if an alert flapped.

Seems the time based approach is susceptible to this as well, e.g.:

I'd usually set silences for a few hours initially

And let's say you're deep into troubleshooting the issue. You've got a lot going on, and your focus is not "let me remember to context switch out to extend the time duration of my silence". So it expires and then you're alerted again (unnecessarily).

Just to add another data point, I just picked this link at random (I have never used the product): sensu silencing And it seems to support both:

  • Expiration after a specified number of seconds
  • Expiration after check returns to OK state (resolves)

So, I think it doesn't seem like a totally crazy idea to have both options..

This doesn't make sense in terms of how Prometheus alerting is meant to be used, and additionally wouldn't work with AM clustering.

There is a third party implementation: https://github.com/prymitive/kthxbye

@roidelapluie awsome

Was this page helpful?
0 / 5 - 0 ratings

Related issues

MaT1g3R picture MaT1g3R  路  5Comments

yongzhang picture yongzhang  路  5Comments

leonerd picture leonerd  路  6Comments

mattbostock picture mattbostock  路  4Comments

jfoine picture jfoine  路  3Comments