Icinga2: Notification behaviour after Downtime ends

Created on 29 Dec 2017  路  10Comments  路  Source: Icinga/icinga2

Expected Behavior

I schedule a fixed downtime for a service.
The service goes CRITICAL within the downtime.
The service is still CRITICAL when the downtime ends.
I expect to be notified right when the downtime ends.

Current Behavior

When the downtime ends, the notification for one contact fires right away, but the notification for a second contact is delayed.

Possible Solution

Experiments show that the interval setting for a notification is the key: The one contact that gets the notification right after the downtime ends has notification interval of 0 (zero).
The other contact has an interval setting of 600 seconds (10m), and he gets the notification 10 minutes after the hard state change happened (during the downtime).

Steps to Reproduce (for bugs)

Please take a look at the attached screenshot which shows the history of such a behaviour:

  • Schedule downtime of 5 minutes for a (passive) service
  • Trigger hard state change to CRITICAL after a couple of seconds in the downtime
  • After downtime has ended, pstiffel is notified right away (the attached notification object has interval of 0)
  • 5 minutes after the downtime has ended (and 10 minutes after the hard state change), jmueller is notified (the attached notification object has interval of 10m)
    2017-12-29_10h12_28

Context

IMHO, the notification should happen immediately after the downtime has ended, no matter which interval was set.
I guess, I watched the same behaviour when using timeperiods which are not 24x7, i.e. when using notification period 9to17 and a outtage happens before that time. The contact attached to notification object with interval 0 is notified right when the notification period starts, the contact attached to a notification object with an interval, is delayed until the next regular interval after the outtage.

The background: Our contact with interval 0 is a ticket system which should only receive one notification, while our staff should be re-informed every hour.

Your Environment

  • icinga2-Version r2.8.0-1
  • Clustered master setup with two nodes
  • Debian 8
arenotifications enhancement needs-sponsoring

Most helpful comment

+1

Could this be solved by adding some sort of queue where all notifications that occured during a downtime (or while outside of an notification timeperiod) are collected? After the downtime ends, the get deduplicated and checked if they still apply. If yes, then the notifications get sent immediately.

All 10 comments

Contacts/Users don't have a notification interval, that's to be defined inside the notification object.

Can you share a sample configuration in order to reproduce the issue?

Cheers,
Michael

Ok, here we go:

I applied the following changes to a vanilla icinga 2.8 installation:

conf.d/services.conf

apply Service "dummy" {
  import "generic-service"
  check_command = "dummy"
  max_check_attempts = 1
  assign where host.name == NodeName
}

conf.d/users.conf

object User "UserA" {
  import "generic-user"
  display_name = "nur eine Benachrichtigung"
  email = "root@localhost"
}

object User "UserB" {
  import "generic-user"
  display_name = "Benachrichtigung jede Stunde"
  email = "root@localhost"
}

conf.d/notifications.conf

apply Notification "einmalige-Mail" to Service {
  import "mail-service-notification"
  users = [ "UserA" ]
  interval = 0
  assign where match(service.name, "dummy")
}

apply Notification "stuendliche-Mail" to Service {
  import "mail-service-notification"
  users = [ "UserB" ]
  interval = 1h
  assign where match(service.name, "dummy")
}

conf.d/templates.conf

template User "generic-user" {
        states = [ Up, Down, OK, Warning, Critical, Unknown ]
        types = [ Problem, Acknowledgement, Recovery ]
}

Here's how to reproduce the problem:

  • disable active checks on the dummy service
  • create fixed downtime of 5min on the dummy service
  • submit CRITICAL state to dummy service

Result:

  • after the downtime has ended, UserA is notified about the CRITICAL service immediately
  • UserB is getting the notification after a serious amount of delay (in the attached screenshot, it is 53 minutes later)
  • icingaadmin gets notified via default notification object after 23 minutes

Conclusion:
IMHO, all users should be notified immediately after the downtime has ended.
In our production environment, I guess the same problem occurs when you use notification timeperiods other than 24x7 and the outtage happens outside of the timeperiod. Here, the notification object with interval=0 fires immediately when the notification period has started, and the other notification objects with interval != 0 fire later. I will reproduce that in the test environment.

2018-01-12_09h19_52

Ok, understood. The main request is to ignore the notification interval if a downtime has ended. Right now the calculated next notification time is

notification -> suppressed by downtime
+10m for next_notification

downtime ends after 5m

5m later, the next notification is sent for the problem

Changing this could break existing setups. I'd like to hear from others what they think. Or see a possible patch to adjust the behaviour and fully test it.

Just a quick addendum: I watched the same behaviour when an outage happens out of a notification period: when the notification period starts, UserA with interval=0 gets a notification immediatly, and the user with interval=60m gets the notification later, apparently with the same formula that dnsmichi has shown before.

I cannot imagine why someone doesn't want to be informed of an outage immediatly when a downtime ends or a notification period starts, so count my vote for a change of that behaviour.

Sure, I hear you. I'm not sure how this can be implemented yet though.

I noticed in my setup the same behavior and I agree with @edpstiffel that a notification should be sent right after the downtime.

BUMP

Our intended setup relies heavily on what @edpstiffel is describing being the case. Consider the following scenario:
You monitor the software update state for ~500 hosts and Icinga notifications are sent directly to the ticket system. For this to work reliably, without spamming our ticket system every now and then, we have defined a downtime specific to the update checks, so that they only run once a week (a full day). With the current behaviour, if a host gets updates during said downtime, no notification will be sent when the downtime is over, since the check interval is 24h.
That being said, I understand people might be relying on the current behaviour for their setups, so maybe finding some middle ground (e.g. a setting to toggle this behaviour) would statisfy all of us.

The same issue or wish for feature request here; every night our print servers were rebooted, at this time they are in downtime. When a service ended at the downtime, then, in this case reboot and the service doesn't came up, we haven't any notification... Yes.. in the downtime it reached critical state, yes.. the state doesn't changed, when the downtime ends...
Maybe a workarround.. we will reset the service to "ok" after downtime with api from our ticket system..

+1

Could this be solved by adding some sort of queue where all notifications that occured during a downtime (or while outside of an notification timeperiod) are collected? After the downtime ends, the get deduplicated and checked if they still apply. If yes, then the notifications get sent immediately.

This is a sponsored feature request, thanks for granting us the time to implement it.

ref/IP/14729

Was this page helpful?
0 / 5 - 0 ratings