Icinga2: Icinga2 sends notification during reload (despite downtime) and does not log it

Created on 19 Dec 2018  路  19Comments  路  Source: Icinga/icinga2

This is similar to #6057. It is even the exact same Icinga2 setup 馃槈

Expected Behavior

  1. Icinga2 should not send a notification for a service that is in downtime (and active checks disabled) during a reload.
  2. When Icinga2 sends a notification it should at least log it.

Current Behavior

We have two masters (master01 and master02) which are responsible for the notifications in our cluster.

  • Service has 7 configured contacts
  • Service is critical
  • All 7 contacts receive notifications
  • Service is in downtime (configured via API)
  • Active checks are disabled (configured via API)
  • Nothing changes for 30 days and Icinga2 reloads every 30 minutes during this time. I do not think that the number of days is relevant but it shows that it works most of the time.
  • Icinga2 reloads
# master01
...
[2018-12-15 20:26:05 +0100] information/ExternalCommandListener: 'command' stopped.
[2018-12-15 20:26:19 +0100] information/FileLogger: 'main-log' started.
[2018-12-15 20:26:22 +0100] information/ApiListener: 'api' started.
...

# master02
[2018-12-15 20:26:23 +0100] information/ApiListener: Restarting after configuration change.
  • 20:26:43 master02 sends a notification to one contact
    This should not happen because the service is still in downtime (and also the active checks are disabled).
    I guess it only happened for one contact because the other notification objects seem to be "scheduled" on master01.
    Icinga2 does not log this notification, but our notification handler does. The notification is also not visible in Icingaweb2.
  • 20:31:10 we acknowledge the alert
  • All 7 contacts receive a notification. This seems to be correct because all contacts received the critical alert 30 days ago.

Possible Solution

I guess that it is again some issue with the startup order or a race condition. But I have no specific idea.

Steps to Reproduce (for bugs)

It seems to happen at random times. I do not even know how often. Currently I only have this one example and no idea how to reproduce it.

Context

We have 2 masters and 6 satellites (4 zones, 2x2 and 2x1 satellites) and deploy our configuration via Puppet on master01 at xx:22. (In this case at 20:22.) Usually Icinga2 selects master02 as the IDO-DB master.

Your Environment

  • Version used (icinga2 --version): r2.10.2-1
  • Operating System and version: Ubuntu 16.04
  • Enabled features (icinga2 feature list): api checker command ido-mysql mainlog notification
  • Icinga Web 2 version and modules (System - About): 2.6.1
  • Config validation (icinga2 daemon -C):
[2018-12-19 17:40:51 +0100] information/cli: Icinga application loader (version: r2.10.2-1)
[2018-12-19 17:40:51 +0100] information/cli: Loading configuration file(s).
[2018-12-19 17:40:53 +0100] information/ConfigItem: Committing config item(s).
[2018-12-19 17:40:53 +0100] warning/ApiListener: Attribute 'key_path' for object 'api' of type 'ApiListener' is deprecated and should not be used.
[2018-12-19 17:40:53 +0100] warning/ApiListener: Attribute 'ca_path' for object 'api' of type 'ApiListener' is deprecated and should not be used.
[2018-12-19 17:40:53 +0100] warning/ApiListener: Attribute 'cert_path' for object 'api' of type 'ApiListener' is deprecated and should not be used.
[2018-12-19 17:40:53 +0100] warning/ApiListener: Please read the upgrading documentation for v2.8: https://icinga.com/docs/icinga2/latest/doc/16-upgrading-icinga-2/
[2018-12-19 17:40:53 +0100] information/ApiListener: My API identity: icinga2.live.master01
[2018-12-19 17:41:03 +0100] information/WorkQueue: #4 (DaemonUtility::LoadConfigFiles) items: 0, rate: 30.5333/s (1832/min 1832/5min 1832/15min);
[2018-12-19 17:41:03 +0100] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2018-12-19 17:41:03 +0100] information/WorkQueue: #6 (ApiListener, SyncQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2018-12-19 17:41:03 +0100] information/WorkQueue: #7 (IdoMysqlConnection, ido-mysql) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 636 ScheduledDowntimes.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 63109 Services.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 1 IcingaApplication.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 2523 Hosts.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 1 FileLogger.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 52704 Dependencies.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 1 NotificationCommand.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 49410 Notifications.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 1 NotificationComponent.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 339 HostGroups.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 1 ApiListener.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 8603 Downtimes.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 522 Comments.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 1 CheckerComponent.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 6 Zones.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 1 ExternalCommandListener.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 8 Endpoints.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 8 ApiUsers.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 331 Users.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 14 CheckCommands.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 201 UserGroups.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 1 ServiceGroup.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 4 TimePeriods.
[2018-12-19 17:41:10 +0100] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2018-12-19 17:41:10 +0100] information/cli: Finished validating the configuration file(s).

Let me know if you think that a detailed configuration of the zones end endpoints would be helpful or if I can do something to help.

arenotifications bug reNC

Most helpful comment

@ekeih Sorry. All we got so far was that the setup in question hit the same problem. I'm still evaluating but I already escaleted the ticket internally so it might well be that someone else will come back to you for details.

All 19 comments

ref/NC/597231

@widhalmt Any chance that the internal ticket contains any useful information to reproduce this issue or something else? Let me know if I can do anything to help with this problem.

@ekeih Sorry. All we got so far was that the setup in question hit the same problem. I'm still evaluating but I already escaleted the ticket internally so it might well be that someone else will come back to you for details.

Thanks for the info! 馃檪

Whiteboxtested v2.10.2 with the following result:

NotificationComponent#NotificationTimerHandler() considers Checkable#IsInDowntime() which looks for downtimes added via Checkable#RegisterDowntime() in Downtime#Start() on downtime activation. The present downtimes always register themselves successfully as the checkables are loaded before the downtimes and the downtimes are activated after loading. In addition NotificationComponents are activated after downtimes.

(Thanks to @Crunsher for consultation.)

TL;DR: The code seems clean. CC @lippserd

@ekeih Please could you provide debug logs around the time the notification was send? And: Are you sure that Icinga 2 calls the notification script?

We also see this behavior in our setup too.

@SimonHoenscheid Do you have some debug logs and information about your setup?

Please could you provide debug logs around the time the notification was send?

Currently not. Last time we used the debug log it grew extremely fast. So I am not able to enable it for several days in production.
I will try to gather more information in our testing environment but it's quite hard without a way to actively reproduce it.

And: Are you sure that Icinga 2 calls the notification script?

Yes, I am 100% sure that it is called by Icinga2.

Found that issue, because we had the same problem in the past.
During a reload the notification script is sometimes triggered even though a downtime is set for a service.
This happens not for every service and it's not reproducible.

[2019-01-25 18:40:41 +0100] information/WorkQueue: #34 (JsonRpcConnection, #23) items: 3, rate: 2495.82/s (149749/min 780477/5min 2344675/15min);
[2019-01-25 18:40:51 +0100] information/WorkQueue: #34 (JsonRpcConnection, #23) items: 1, rate: 2505.15/s (150309/min 780335/5min 2344155/15min);
[2019-01-25 18:41:01 +0100] information/WorkQueue: #34 (JsonRpcConnection, #23) items: 4, rate: 2514.7/s (150882/min 780287/5min 2347299/15min);
[2019-01-25 18:41:02 +0100] information/Application: Reload requested, letting new process take over.
[2019-01-25 18:41:02 +0100] information/ApiListener: 'api' stopped.
[2019-01-25 18:41:02 +0100] information/CheckerComponent: 'checker' stopped.
[2019-01-25 18:41:03 +0100] information/ExternalCommandListener: 'command' stopped.
[2019-01-25 18:42:01 +0100] information/FileLogger: 'main-log' started.
[2019-01-25 18:42:10 +0100] information/WorkQueue: #10 (DaemonCommand::Run) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2019-01-25 18:42:13 +0100] information/ApiListener: 'api' started.
[2019-01-25 18:42:13 +0100] information/ApiListener: Copying 3 zone configuration files for zone 'xx.yy.zz.de' to '/var/lib/icinga2/api/zones/xx.yy.zz.de'.
[2019-01-25 18:42:13 +0100] information/ApiListener: Applying configuration file update for path '/var/lib/icinga2/api/zones/xx.yy.zz.de' (0 Bytes). Received timestamp '2019-01-25 18:42:13 +0100' (1548438133.171498), Current timestamp '2019-01-25 11:50:50 +0100' (1548413450.769627).
[2019-01-25 18:42:13 +0100] information/ApiListener: Copying 12 zone configuration files for zone 'director-global' to '/var/lib/icinga2/api/zones/director-global'.

Historie in Icingweb2 is empty for that timestamp ...

grafik

... SMS-Gateway shows Icinga has triggered the notification.

grafik

Systemd logs something like this

Jan 25 18:41:02 localhost.localdomain systemd[1]: icinga2.service: Supervising process 32212 which is not our child. We'll most likely not notice when it exits.

We're running a single master with several zones. So I think it's not a cluster problem ...

Another idea ... Is it possible, that the newly spawn Icinga2 process will send this notification while starting up? Because ACKs sent via API are shown in Icingaweb2:

grafik

The notification script automatically sends an ACK to the API for the triggered service:

if ( $response->is_success ) {
  print $response->decoded_content . "\n";
  my $retval = $response->decoded_content;
  $post_data = "\"type\": \"Host\", \"filter\": \"host.name==\\\"$icinga2_host_name\\\"\", \"author\": \"Notification-Manager\", \"comment\": \"Es wurde erfolgreich eine SMS an $smsgw_pager ($icinga2_user_display_name) versendet. Status der SMS: $retval.\", \"notify\": false, \"sticky\": true";
  my $command = "curl -u '$icinga2_api_user:$icinga2_api_password' -H 'Accept: application/json' -X POST -k '$endpoint_url' -d '{ $post_data}'";
  exec($command);
} else {
  my $retval = $response->decoded_content;
  my $retmsg = $response->status_line;
  $post_data = "\"type\": \"Host\", \"filter\": \"host.name==\\\"$icinga2_host_name\\\"\", \"author\": \"Notification-Manager\", \"comment\": \"Es trat ein Fehler beim versenden der SMS an $smsgw_pager ($icinga2_user_display_name) auf. Status der SMS: $retval ($retmsg).\", \"notify\": false, \"sticky\": true";
  my $command = "curl -u '$icinga2_api_user:$icinga2_api_password' -H 'Accept: application/json' -X POST -k '$endpoint_url' -d '{ $post_data}'";
  exec($command);

  print $response->decoded_content . "\n";
  die $response->status_line . "\n";
}

So the API accepts and process connections and events, but logs nothing to /var/log/icinga2/icinga2.log.

The reload on 25th at ~ 18:40 was triggered by Director:

grafik

As I understand the state of Icinga2 is written to
/var/lib/icinga2/icinga2.state during a restart.

So ... Which Icinga2-Core receives the API-Call during reload in that case? The old or the new one? Why is an API-Call persisten, but nothing is logged or written to ido?
Any ideas?

Update:

After digging through the notificationcomponent code as @Al2Klimov described, I found, that the following isn't never logged to the /var/log/icinga2/icinga2.log in our environments:

void NotificationComponent::Stop(bool runtimeRemoved)
{
    Log(LogInformation, "NotificationComponent")
        << "'" << GetName() << "' stopped.";

    ObjectImpl<NotificationComponent>::Stop(runtimeRemoved);
}

With a

systemctl restart icinga2

only the startup of the component is logged:

[2019-01-29 15:11:45 +0100] information/NotificationComponent: 'notification' started.

A reload with systemd or director also brings up the NotificationComponent message in the logfile.
At January, 25th, the NotificationComponent was also started about a minute after the "notification" was triggered.

[2019-01-25 18:42:14 +0100] information/NotificationComponent: 'notification' started.

An "NotificationComponent ... stopped" event, which should be thrown during shutdown as I understand, wasn't also logged in that case.

I assume something goes wrong during shutdown, which leads to a trigger that sends notifications out.

During reload the daemon starts a copy of itself and then shuts down. There's a small period of time during which any of them could receive the request. At the moment log messages about API requests may be lost, but #6827 would fix that.

@SimonHoenscheid Do you have some debug logs and information about your setup?

I can sit down next week and put everything together. regarding debug logs, I need to have a look of the impacts this will have on our system.

During a reload the notification script is sometimes triggered even though a downtime is set for a service.

We see the same issue:
screen shot 2019-02-11 at 09 41 47
At 03:08 there was a Icinga2 reload, but nothing was logged about a notification:

[2019-02-11 03:09:50 +0100] information/FileLogger: 'main-log' started.
[2019-02-11 03:09:56 +0100] information/ApiListener: 'api' started.
[2019-02-11 03:09:57 +0100] information/ExternalCommandListener: 'command' started.
[2019-02-11 03:09:57 +0100] information/GelfWriter: 'gelf' started.
[2019-02-11 03:09:57 +0100] information/InfluxdbWriter: 'influxdb' started.
[2019-02-11 03:09:57 +0100] information/CheckerComponent: 'checker' started.
[2019-02-11 03:09:57 +0100] information/NotificationComponent: 'notification' started.
[2019-02-11 03:09:57 +0100] information/DbConnection: 'ido-mysql' started.

what is even more strange:

  • graylog logged a notification at 02:04

This does not make any sense, right? when the service was critical, graylog logged the notification but it was not actually triggered. during the reload at 03:08 the / a notification was triggered but not logged to graylog?

@ekeih You could auto-trucate the debug log by a cron script not to run out of disk space.

Same here. If any addtional Logs or Information are required, please feel free to ask.

Thanks

@MaBauMeBad I think the main reason that there is no progress is that nobody provided debug logs. I currently do not have the resources to do so.
So if you can provide debug logs of this issue, this would probably help ;)

Hi all,
just wanted to confirm this. Had it right now. Scenario:
Scheduled Downtimes of several Hosts, which are off. Edited the Icinga2 Config of other hosts, reloaded the config and got notifications of the downtimed hosts.

Cheers,
Marcus

I consider this fixed with 2.11 from the many patches we've implemented in this region.

Was this page helpful?
0 / 5 - 0 ratings