Icinga2: Set the notification mode times.begin is not 0, the first notification has a delay

Created on 7 Sep 2017  路  5Comments  路  Source: Icinga/icinga2

Expected Behavior


Current Behavior


There are three kinds of notifications for each of my monitoring items:

mail-notification:

object Notification "mail-notification" {
    import "mail-service-notification"

    host_name = "l-jyftest1.ops.beta.cn0"
    interval = 300.000000
    service_name = "check_load"
    times = {
        begin = 0.000000
    }
    users = [ "mika.liu", "yufeng.ji", "kai.duan", "liangliang.pan", "zhangj.zhang", "jiwei.zhao", "yinan.gao", "yankun.zhu" ]
    version = 1504145701.727930
    zone = "master"
}

push-notification:

object Notification "mail-notification" {
    import "push-service-notification"

    host_name = "l-jyftest1.ops.beta.cn0"
    interval = 300.000000
    service_name = "check_load"
    times = {
        begin = 0.000000
    }
    users = [ "qunaropspush" ]
    vars["userlist"] = "mika.liu,yufeng.ji,kai.duan,liangliang.pan,zhangj.zhang,jiwei.zhao,yinan.gao,yankun.zhu"
    version = 1504145701.562974
    zone = "master"
}

ivr-notification:

object Notification "ivr-notification" {
    import "ivr-service-notification"

    host_name = "l-jyftest1.ops.beta.cn0"
    interval = 300.000000
    service_name = "check_load"
    times = {
        begin = 900.000000
    }
    users = [ "qunaropsivr" ]
    vars["userlist"] = "mika.liu,yufeng.ji,kai.duan,liangliang.pan,zhangj.zhang,jiwei.zhao,yinan.gao,yankun.zhu"
    version = 1504145701.873894
    zone = "master"
}

The current situation has been found, ivr-notification when the first notification, there are some delays, than the "times.begin" set the time later.

Possible Solution


Now set the "times.begin = 900" will probably be delayed for more than 1 minute, if there is no way to solve, we may be in the normal begin time: "times.begin = 900" - 60s

Steps to Reproduce (for bugs)


as the picture shows:

image

Theoretically speaking, "ivr-notification" The first notice should be (19:39:51 + 15min) = 19:54:51;
but now "ivr-notification" The first notice is "19:56:10";

delay time: (19:56:10 - 19:54:51) = 1min 19s;

Context


Your Environment

  • Version used (icinga2 --version):
icinga2 - The Icinga 2 network monitoring daemon (version: r2.7.0-1)

Copyright (c) 2012-2017 Icinga Development Team (https://www.icinga.com/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Application information:
  Installation root: /usr
  Sysconf directory: /etc
  Run directory: /run
  Local state directory: /var
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid

System information:
  Platform: CentOS Linux
  Platform version: 7 (Core)
  Kernel: Linux
  Kernel version: 3.10.0-514.26.2.el7.x86_64
  Architecture: x86_64

Build information:
  Compiler: GNU 4.8.5
  Build host: unknown
  • Operating System and version:
CentOS Linux release 7.3.1611 (Core)
  • Enabled features (icinga2 feature list):
Disabled features: command compatlog debuglog gelf graphite ido-pgsql influxdb livestatus opentsdb perfdata statusdata syslog
Enabled features: api checker ido-mysql mainlog notification
  • Icinga Web 2 version and modules (System - About):
  • Config validation (icinga2 daemon -C):
icinga2 daemon -C
information/cli: Icinga application loader (version: r2.7.0-1)
information/cli: Loading configuration file(s).
information/ConfigItem: Committing config item(s).
warning/Zone: The Zone object 'zone-cn0' has more than two endpoints. Due to a known issue this type of configuration is strongly discouraged and may cause Icinga to use excessive amounts of CPU time.
information/ApiListener: My API identity: l-cn0qicingam1.ops.cn2
information/ConfigItem: Instantiated 5 ApiUsers.
information/ConfigItem: Instantiated 1 ApiListener.
information/ConfigItem: Instantiated 3 Zones.
information/ConfigItem: Instantiated 1 FileLogger.
information/ConfigItem: Instantiated 5 Endpoints.
information/ConfigItem: Instantiated 1 UserGroup.
information/ConfigItem: Instantiated 41614 Notifications.
information/ConfigItem: Instantiated 4 NotificationCommands.
information/ConfigItem: Instantiated 274 CheckCommands.
information/ConfigItem: Instantiated 124 Downtimes.
information/ConfigItem: Instantiated 2 HostGroups.
information/ConfigItem: Instantiated 1 IcingaApplication.
information/ConfigItem: Instantiated 1 EventCommand.
information/ConfigItem: Instantiated 5581 Hosts.
information/ConfigItem: Instantiated 10458 Users.
information/ConfigItem: Instantiated 3 TimePeriods.
information/ConfigItem: Instantiated 51407 Services.
information/ConfigItem: Instantiated 4 ServiceGroups.
information/ConfigItem: Instantiated 1 CheckerComponent.
information/ConfigItem: Instantiated 1 NotificationComponent.
information/ConfigItem: Instantiated 1 IdoMysqlConnection.
information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
information/cli: Finished validating the configuration file(s).
  • If you run multiple Icinga 2 instances, the zones.conf file (or icinga2 object list --type Endpoint and icinga2 object list --type Zone) from all affected nodes.
/*
 * Generated by Icinga 2 node setup commands
 * on 2017-01-09 17:12:48 +0800
 */

object Endpoint "l-cn0qicingam1.ops.cn2" {
    host = "10.90.0.81"
    log_duration = 5m
}

object Endpoint "l-cn0qicingam2.ops.cn2" {
    host = "10.90.0.82"
    log_duration = 5m
}

object Endpoint "l-cn0qicingan1.ops.cn2" {
    host = "10.90.0.85"
    log_duration = 5m
}

object Endpoint "l-cn0qicingan2.ops.cn2" {
    host = "10.90.0.86"
    log_duration = 5m
}

object Endpoint "l-cn0qicingan3.ops.cn2" {
    host = "10.90.0.84"
    log_duration = 5m
}

object Zone "master" {
    endpoints = [ "l-cn0qicingam1.ops.cn2", "l-cn0qicingam2.ops.cn2" ]
}

object Zone "zone-cn0" {
    endpoints = [ "l-cn0qicingan1.ops.cn2", "l-cn0qicingan2.ops.cn2", "l-cn0qicingan3.ops.cn2" ]
    parent = "master"
}

/* sync global commands */
object Zone "global-templates" {
        global = true
}
arenotifications bug

Most helpful comment

I have researched this issue a bit more and just submitted a pull request for some changes which I think better address the problem.

The last hard state change time isn't necessarily the best moment to base the begin and end times on since a notification setup for, say, warning and critical states, will have those times pushed back if the service flips between critical and warning before they are reached (this could delay notifications indefinitely and lead to some uncomfortable questions from admins who never got notified of the problem).

In the pull request, I implemented a field to track the actual trigger time of the notification (when it first goes into an applicable hard problem state). That trigger time is reset on recovery or on changing to a non-applicable state. It's a small change in the behaviour of the system, so it should be carefully reviewed, but I believe most users would expect and want the Notifications to act like this, especially if they want a specific layered escalation system for handling problems.

All 5 comments

Please fetch the runtime information for the ivr-notification from the REST API /v1/objects/notifications/ at several points in time:

  • when the service is OK
  • when the service is NOT-OK and in in its first hard state
  • after the notification was actually sent

I am seeing the same issue with initial notification delays (later than "begin" time). I used the REST API to monitor the notification objects and I believe I understand at least what it's doing.

  • when a service is OK, the notification is still calculating a next_notification time based on the notification interval.

  • when a service has a problem, IF the next_notification at that moment comes AFTER the begin time, then the next_notification is updated according to the begin time, and the notification goes out on time, as expected.

  • However, IF the next_notification comes BEFORE the begin time, then the next_notification is not changed, until that "next_notification" time arrives...at which point it will be changed to that moment + the begin time.

So, if begin time is 10 minutes, the interval is 15 minutes, and a service goes critical 14 minutes before its currently listed "next_notification"...then the notification goes out at 10 minutes.

If that same service is 5 minutes from the "next" notification when it goes critical, then the notification goes out at 15 minutes instead of 10 minutes. (5 remaining minutes + the 10 minutes "begin" time).

I dug into the source code a bit and believe I have found where the problem is located. In lib/icinga/notification.cpp , in the function: void Notification::BeginExecuteNotification(...), starting around line 293 or so, there are these instructions for setting up the next notification based on times.begin :

/* we need to adjust the next notification time
* to now + begin delaying the first notification
*/

double nextProposedNotification = now + timesBegin + 1.0;
if (GetNextNotification() > nextProposedNotification)
    SetNextNotification(nextProposedNotification);

I have tweaked this a bit and could at least make it work with either of a couple of changes. I could remove the conditional entirely:

/* we need to adjust the next notification time
* to now + begin delaying the first notification
*/

double nextProposedNotification = now + timesBegin + 1.0;
SetNextNotification(nextProposedNotification);

Or I could offset the begin time against the last state change, instead of "now" (which could be "later", if the condition prevented the SetNextNotification on the first try ).

/* we need to adjust the next notification time
* to now + begin delaying the first notification  (well, maybe not "now" after all)
*/

double nextProposedNotification = checkable->GetLastHardStateChange() + timesBegin + 1.0;
if (GetNextNotification() > nextProposedNotification)
    SetNextNotification(nextProposedNotification);

I am not sure why the condition (GetNextNotification() > nextProposedNotification) is taken into consideration, but suspect there is a good reason for that. (re-notifications for previous events?)

In any case, though, shouldn't the "begin" time of a notification be referencing the object's hard state change time, instead of "now" ?

I have researched this issue a bit more and just submitted a pull request for some changes which I think better address the problem.

The last hard state change time isn't necessarily the best moment to base the begin and end times on since a notification setup for, say, warning and critical states, will have those times pushed back if the service flips between critical and warning before they are reached (this could delay notifications indefinitely and lead to some uncomfortable questions from admins who never got notified of the problem).

In the pull request, I implemented a field to track the actual trigger time of the notification (when it first goes into an applicable hard problem state). That trigger time is reset on recovery or on changing to a non-applicable state. It's a small change in the behaviour of the system, so it should be carefully reviewed, but I believe most users would expect and want the Notifications to act like this, especially if they want a specific layered escalation system for handling problems.

refs #5890

Was this page helpful?
0 / 5 - 0 ratings