Alertmanager: Incorrect needsUpdate in DedupStage for notifications results in spammy emails

Created on 6 Jan 2017  Â·  8Comments  Â·  Source: prometheus/alertmanager

Alertmanager currently sends out notifications when it thinks the alert content has changed within a group and within the group_interval. It does so by comparing the hash of a group of alerts (firing and resolved alerts).

In my test case using

scrape_interval: 5s
group_wait: 30s
group_interval: 2m
repeat_interval: 24h

and reproducible via

curl -XPOST -d'[{"labels":{"alertname":"a1"}}]' http://localhost:9093/alertmanager/api/v1/alerts; date ; sleep 15
curl -XPOST -d'[{"labels":{"alertname":"a2"}}]' http://localhost:9093/alertmanager/api/v1/alerts; date ; sleep 15
curl -XPOST -d'[{"labels":{"alertname":"a3"}}]' http://localhost:9093/alertmanager/api/v1/alerts; date ; sleep 15
curl -XPOST -d'[{"labels":{"alertname":"b1"}}]' http://localhost:9093/alertmanager/api/v1/alerts; date ; sleep 15
curl -XPOST -d'[{"labels":{"alertname":"b2"}}]' http://localhost:9093/alertmanager/api/v1/alerts; date ; sleep 15
curl -XPOST -d'[{"labels":{"alertname":"b3"}}]' http://localhost:9093/alertmanager/api/v1/alerts; date ; echo "done... now looping b3 for keepAlive"
while true; do  curl -XPOST -d'[{"labels":{"alertname":"b3"}}]' http://localhost:9093/alertmanager/api/v1/alerts; date  ; sleep 15; done
echo "done"

It is causing 4 notifications (where I think it should only be sending 2).

This comes from the _needsUpdate_ function in notify.go :

    if !bytes.Equal(entry.GroupHash, hash) {
        return true, nil
    }

I guess the hash will not be sufficient to determine if an alert should be sent.

If all alerts (including the resolved ones) within a group are hashed it could explain the 4 notifications:

My results :
Notification 1

 a1[7a091fb][active]

Notification 2

a2[4ad4244][active]
a3[7732e17][active]
b1[99488d8][active]
b2[4a429ad][active]
b3[3a21090][active]
a1[7a091fb][resolved]

Notification 3

b3[3a21090][active]
a2[4ad4244][resolved]
a3[7732e17][resolved]
b1[99488d8][resolved]
b2[4a429ad][resolved]

Notification 4

b3[3a21090][active]

So from Alertmanager perspective, the content has changed each time and it will send out 4 notifications. (The 5th run it will again see b3[3a21090][active] and will not send a notification)

Expected behaviour :

Actual behaviour :

areusability componennotify kinbug

Most helpful comment

Probably wise to add some metrics for bandwidth use, as that's going to be the primary limiting factor for the AM I expect.

All 8 comments

Thanks for the detailed report and basically debugging the problem already.
I agree that the behavior you describe it probably the desirable one.

Looks like we have to reconsider the needsUpdate function a fair bit. I'll report back when I've figured something out.

Thx ! Looking forward to hear back from you ... I'm always available to test and give feedback. (unless I all of a sudden acquire some super-natural go skills over the course of the coming days).

Keep up the good work !

Also had the following scenario where 4 notifications were sent out containing the alerts below.

| Time (UTC) | alert1| alert2 | alert3 | alert4 | alert5 | alert6
| ------------- |:------:|:------:|:------:|:------:|:------:|:------:|
| 19:28 | Firing | Firing | Firing | Firing | Resolved | Resolved |
| 19:38 | Firing | Firing | Firing | Firing | Resolved | absent |
| 19:43 | Firing | Firing | Firing | Firing | absent | absent |
| 20:08 | Firing | Resolved | Firing | Firing | absent | absent |

With a send_resolved:false I think it should have only sent out the first one, as all notifications after the first one contain firing alerts that have already been notified. (with a repeat_interval:24h).

If my understanding is correct, group_interval or repeat_interval doesn't apply here as we're dealing with alerts that have already been notified.

So I realized some semantic errors when going through the code. We can get rid of some of the noise, others require alterations to the snapshot format of sent notifications.
That's generally not a problem aside from some potential extra noise after upgrading – but I've to think about the performance implications it may have as we have to store potentially a "lot" more data.

In the end a trick added in the HA version we thought would help us gain performance cannot handle the cases you described.

Could we store one hash for the firing alerts and another for the resolved?

So @fabxc and I analyzed the problem and figured we will need the full list of active and resolved alerts to be able to build subsets from when evaluating whether we need to resend or not. Hashes will not be enough as we need to know whether the currently firing alerts are a subset of the previously sent alerts. Same goes for the resolved alerts. Therefore two lists of alerts, one for resolved one for active.

This will increase the payload gossiped through the mesh network, however, we could potentially tweak this with bloom filters if necessary. We have not validated that bloom filters would actually work, I will play it through tomorrow, but either way for that optimization we should first benchmark it.

Probably wise to add some metrics for bandwidth use, as that's going to be the primary limiting factor for the AM I expect.

Fixed by #703

Was this page helpful?
0 / 5 - 0 ratings