Alertmanager: Expired silences are not garbage collected after -data.retention has passed.

Created on 15 Sep 2017  Â·  16Comments  Â·  Source: prometheus/alertmanager

Im using the api to create a silence, and afterwards delete it.
The deleted silence goes to expired.
So potentially I could have hundreds of expired alerts.
Is there a way to delete expired silences, or some other way to handle this case.
I'm on version 0.7.1.
Thanks,
Zarko

componensilences kinbug

Most helpful comment

I think #999 nailed it. Things look totally reasonable now in our 5-node Alertmanager cluster at SoundCloud. I declare this fixed. Please re-open if you have contrary evidence.

All 16 comments

Expired silences will be automatically garbage collected after ~14 days.
The reason is that it's good to still see them afterwards. For example to
analyse why certain alerts were not sent or to re-create a silence.

On Fri, Sep 15, 2017 at 10:12 AM zarkoc notifications@github.com wrote:

Im using the api to create a silence, and afterwards delete it.
The deleted silence goes to expired.
So potentially I could have hundreds of expired alerts.
Is there a way to delete expired silences, or some other way to handle
this case.
I'm on version 0.7.1.
Thanks,
Zarko

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/prometheus/alertmanager/issues/996, or mute the
thread
https://github.com/notifications/unsubscribe-auth/AEuA8i48G_AjlZ7Ud8JIf4NPVDWYqsIDks5sijFogaJpZM4PYoXQ
.

Thanks! Didnt know that.

@beorn7 you reported that soundcloud has several thousand expired silences -- can you check to see if they're being GC'd? The retention is set by -data.retention, with a default value of 5 days.

@fabxc Looking at the code it looks like 5 days is the default time, did I miss something that sets it to 14 days?

No, I was honestly just guessing – has been a while since writing that code.

On Fri, Sep 15, 2017 at 10:47 AM stuart nelson notifications@github.com
wrote:

@beorn7 https://github.com/beorn7 you reported that soundcloud has
several thousand expired silences -- can you check to see if they're being
GC'd? The retention is set by -data.retention, with a default value of 5
days.

@fabxc https://github.com/fabxc Looking at the code it looks like 5
days is the default time, did I miss something that sets it to 14 days?

—
You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
https://github.com/prometheus/alertmanager/issues/996#issuecomment-329721168,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEuA8i1cHtd6o0vEVam1jdUfB8mbTjM-ks5sijmFgaJpZM4PYoXQ
.

Not a problem, just wanted to make sure I wasn't crazy

The flag -data.retention has indeed a default value of 120h (== 5d). We're running AM 0.8.0 with the default flage, and we have accumulated 4341 expired silences by now, most of them definitely expired longer ago than 5d. Or 14d, FWIW.

Hey, is it okay if I take stab at this one?

I think that would be great, @josedonizetti . @stuartnelson3 might be able to give more context.

I run a few tests on this, and:

1) Standalone test

All tests I did here worked perfectly, the silences will always be GC. But one point caught my attention if for some reason one of the expires has a zero value time it will break the GC loop and stop the Maintenance go routine that runs it. The error is logged but no more GC will run because of it.

https://github.com/prometheus/alertmanager/blob/master/silence/silence.go#L286
https://github.com/prometheus/alertmanager/blob/master/silence/silence.go#L235

I was thinking that instead of stopping the Maintenance routine for good in case an error value appears, we should log the error, but keep GCing stuff. Basically changing the L286 to log the error, and continue. What do you think?

2) HA test

3 instances up (A,B,C), sent a few alerts and silence/expired them. Instance A runs GC and clean the expired silences, but a few seconds later receive the state back from B. Then C runs GC and clean it, but then receive it back from A. This behavior keeps looping forever.

Gonna work on a PR for 2.
What do you think about my suggestion for 1?

  1. If there is an error, it's contained with an anonymous function, so returning just exits that single function. It returns the error, which will then be logged (https://github.com/prometheus/alertmanager/blob/master/silence/silence.go#L253-L262), and remain within the for-select loop. If a GC() fails, it will still be retried.

Now, whether we can ever have a silence with zero ExpiresAt time is not something I know off the top of my head (it doesn't look like this is checked by the API). This makes me think that it should NOT happen, and we need to make sure that incoming creation requests are validated and fail if no expires at is present.

  1. I assume you meant to say "silences". Looking at the current code, the gossip'd silences from node B should only replace the silences on the receiving node A if the gossip'd silence has a more recent updatedAt time. https://github.com/prometheus/alertmanager/blob/master/silence/silence.go#L860-L863

Which version of alertmanager were you running when testing this? within the last week or so we caught one (of several) race conditions within alertmanager, and someone was nice enough to submit a pr.

@stuartnelson3

  1. True. I missed that!
  2. Gonna extend the discussion on the PR https://github.com/prometheus/alertmanager/pull/999

I think #999 nailed it. Things look totally reasonable now in our 5-node Alertmanager cluster at SoundCloud. I declare this fixed. Please re-open if you have contrary evidence.

Thanks again, @josedonizetti !

yay \o/

I have set --data.retention=1h,but my expired silenced still in memory after 1 hour?
I want to know which piece of code affect the expired sliences?
Looking forward to your reply! Thanks a lot!

It makes more sense to ask questions like this on the prometheus-users mailing list rather than in a GitHub issue. On the mailing list, more people are available to potentially respond to your question, and the whole community can benefit from the answers provided.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

FirstEncounter picture FirstEncounter  Â·  4Comments

stuartnelson3 picture stuartnelson3  Â·  5Comments

leonerd picture leonerd  Â·  6Comments

fchiorascu picture fchiorascu  Â·  5Comments

marcan picture marcan  Â·  4Comments