We push new configurations on an hourly base into icinga. After this a config validation and a icinga2 reload are performed. The script, which does this runs every 11th minute. Our environment contains a master zone and 7 satellite zones. The masters and slaves in each zone are installed as ha-pairs of two machines. We have 17k hosts and 73k services.
We recently noticed a few, but nasty side effects from those reloads:
Sometimes the check attempt count seems to be set back during a reload:
https://i.imgur.com/RcJfd8S.png
Shortly after a reload there's also a point in time where notifications are not being sent:
https://monitoring-portal.org/uploads/default/original/2X/1/1a2e69717db5d8e2382e64a6eb2ec8d44208b324.png
I allready discuss this matter at monitoring-portal where michi dropped an interesting hint at post #5.
Reloads should be statefull.
reloads seem to reset the check attempt count.
also notifications are not being sent shortly after a reload.
unknown
both problems are a big problem for us. because they undermine the trust in the monitoring solution :~
icinga2 --version):icinga2 - The Icinga 2 network monitoring daemon (version: r2.9.1-1)
Copyright (c) 2012-2018 Icinga Development Team (https://www.icinga.com/)
License GPLv2+: GNU GPL version 2 or later http://gnu.org/licenses/gpl2.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Application information:
Installation root: /usr
Sysconf directory: /etc
Run directory: /run
Local state directory: /var
Package data directory: /usr/share/icinga2
State path: /var/lib/icinga2/icinga2.state
Modified attributes path: /var/lib/icinga2/modified-attributes.conf
Objects path: /var/cache/icinga2/icinga2.debug
Vars path: /var/cache/icinga2/icinga2.vars
PID path: /run/icinga2/icinga2.pid
System information:
Platform: Red Hat Enterprise Linux Server
Platform version: 7.4 (Maipo)
Kernel: Linux
Kernel version: 3.10.0-693.21.1.el7.x86_64
Architecture: x86_64
Build information:
Compiler: GNU 4.8.5
Build host: unknown
RHEL 7.4
icinga2 feature list):Disabled features: command compatlog debuglog elasticsearch gelf graphite influxdb livestatus mainlog opentsdb statusdata
Enabled features: api checker ido-mysql notification perfdata syslog
Icinga Web 2 Version
2.6.0
Git Commit
cfe6c7b06587189b3ef688183cacd32594db071a
Git Commit Datum
2018-07-19
Copyright
漏 2013-2018 Das Icinga Projekt
Geladene Module
Name Version
businessprocess 2.1.0
monitoring 2.6.0
pnp 1.1.0
icinga2 daemon -C):[2018-09-04 18:24:32 +0200] information/cli: Icinga application loader (version: r2.9.1-1)
[2018-09-04 18:24:32 +0200] information/cli: Loading configuration file(s).
[2018-09-04 18:24:33 +0200] information/ConfigItem: Committing config item(s).
[2018-09-04 18:24:33 +0200] information/ApiListener: My API identity: dxzmicinga01
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 73387 Services.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 1 SyslogLogger.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 42 HostGroups.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 1 NotificationComponent.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 1 NotificationCommand.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 81900 Notifications.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 1 IcingaApplication.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 17060 Hosts.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 1 ApiListener.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 1 PerfdataWriter.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 1 CheckerComponent.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 12 Zones.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 17 Endpoints.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 4 ApiUsers.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 249 CheckCommands.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 3 TimePeriods.
[2018-09-04 18:24:39 +0200] information/ConfigItem: Instantiated 1 User.
[2018-09-04 18:24:39 +0200] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2018-09-04 18:24:39 +0200] information/cli: Finished validating the configuration file(s).
Ensure that the icinga2.state file is intact and read during reloads, as otherwise you're loosing runtime data.
In terms of the notification 0 user problem, you should investigate further but not as part of this issue.
Thanks for the input. The file is intact and readable. I'm using the command "systemctl reload icinga2". How can I ensure that the icinga2.state file is read? I searched the docs and the internet but sadly found no clue.
bump
No need for bumping issues, we'll reply once there's time to.
Once the state file is read, the log will write something like this: Restored X objects. Loaded Y new objects without state..
Sorry for the bumping.
I can't find the relating log entries anywhere in the syslog or in the debuglog. I temporarely enabled the mainlog feature, did a reload (systemctl reload icinga2) and looked there too. Nothing.
To have a reference I looked for those log entries in other github issues. Should be something like this:
information/ConfigObject: Restoring program state from file '/var/lib/icinga2/icinga2.state'
I have never seen them before in our environment.
This is the syslog output during a reload:
Sep 19 08:24:12 icingamaster01 systemd: Removed slice User Slice of root.
Sep 19 08:24:12 icingamaster01 systemd: Stopping User Slice of root.
Sep 19 08:24:13 icingamaster01 xinetd[1899]: EXIT: csync2 status=0 pid=53087 duration=11(sec)
Sep 19 08:24:18 icingamaster01 safe-reload: Validating config files: Done
Sep 19 08:24:18 icingamaster01 safe-reload: Reloading Icinga 2: Done
Sep 19 08:24:18 icingamaster01 systemd: Reloaded Icinga host/service/network monitoring system.
Sep 19 08:24:26 icingamaster01 systemd: icinga2.service: Supervising process 53355 which is not our child. We'll most likely not notice when it exits.
Sep 19 08:24:34 icingamaster01 icinga2: Ignoring config update for zone 'zone1' because we have an authoritative version of the zone's config.
Sep 19 08:24:34 icingamaster01 icinga2: Ignoring config update for zone 'zone2' because we have an authoritative version of the zone's config.
Sep 19 08:24:34 icingamaster01 icinga2: Ignoring config update for zone 'global-templates' because we have an authoritative version of the zone's config.
Sep 19 08:24:34 icingamaster01 icinga2: Ignoring config update for zone 'zone3' because we have an authoritative version of the zone's config.
Sep 19 08:24:34 icingamaster01 icinga2: Ignoring config update for zone 'zone4' because we have an authoritative version of the zone's config.
Sep 19 08:24:34 icingamaster01 icinga2: Ignoring config update for zone 'linux-commands' because we have an authoritative version of the zone's config.
Sep 19 08:24:34 icingamaster01 icinga2: Ignoring config update for zone 'master' because we have an authoritative version of the zone's config.
Sep 19 08:24:34 icingamaster01 icinga2: Ignoring config update for zone 'windows-commands' because we have an authoritative version of the zone's config.
Sep 19 08:24:34 icingamaster01 icinga2: Ignoring config update for zone 'zone5' because we have an authoritative version of the zone's config.
Sep 19 08:24:34 icingamaster01 icinga2: Ignoring config update for zone 'zone6' because we have an authoritative version of the zone's config.
Sep 19 08:24:34 icingamaster01 icinga2: Ignoring config update for zone 'zone7' because we have an authoritative version of the zone's config.
The content of /lib/systemd/system/icinga2.service looks like this:
[Unit]
Description=Icinga host/service/network monitoring system
After=syslog.target network-online.target postgresql.service mariadb.service carbon-cache.service carbon-relay.service
[Service]
Type=notify
EnvironmentFile=/etc/sysconfig/icinga2
ExecStartPre=/usr/lib/icinga2/prepare-dirs /etc/sysconfig/icinga2
ExecStart=/usr/sbin/icinga2 daemon -e ${ICINGA2_ERROR_LOG}
PIDFile=/run/icinga2/icinga2.pid
ExecReload=/usr/lib/icinga2/safe-reload /etc/sysconfig/icinga2
TimeoutStartSec=30m
# Systemd >228 enforces a lower process number for services.
# Depending on the distribution and Systemd version, this must
# be explicitly raised. Packages will set the needed values
# into /etc/systemd/system/icinga2.service.d/limits.conf
#
# Please check the troubleshooting documentation for further details.
# The values below can be used as examples for customized service files.
#TasksMax=infinity
#LimitNPROC=62883
[Install]
WantedBy=multi-user.target
Is there a chance that the icinga2.state file is being ignored?
We have a similar issue with reloads. We have a check that monitors if check results of all services are fresh by using the internal state of the objects (e.g. service objects last_check attribute). after reloads / restarts these values are old (up to 5min, from the time where the last Dumping program state to file '/var/lib/icinga2/icinga2.state' appeared in the log).
So I highly suspect that it's not a problem with reading the state file, more it seems that the state file is not updated on a clean shutdown (reload).
If this is true and I got this right this would mean Icinga2 currently looses up to 5min of all state during every restart / reload.
I can verify your observation. The icinga2.state file is not updated before/during a reload.
I think we have a similar issue.
We have three different kinds of notifications:
This is what happened:
I am still looking for a way to reproduce this in a reliable way. It would be also interesting to be able to check how often something like this happens. So if anyone has an idea I am happy to hear it :)
Some additional context that could be useful:
This happened with Icinga2 2.9.1. (In the meantime we updated to 2.9.2.)
We have a HA setup with two masters and 3x2 satellites. Puppet generates the config on master01 which synchronizes the config to master02. When both masters are up Icinga2 selects master02 to write into the IDODB.
According to the Icinga2 log it dumps the current state every 5 minutes, but not immediately before the reload, with the following line:
information/ConfigObject: Dumping program state to file '/var/lib/icinga2/icinga2.state'
We do not have the following line anywhere in our logs:
information/ConfigObject: Restoring program state from file '/var/lib/icinga2/icinga2.state'
@ekeih it's important to understand that the data (state of a check, last check result, ack, etc.) in the idodb (icingaweb2) is not necessarily the same as internally in icinga2. icinga2 does not read from the idodb, it only exports the state to it. after a restart, icinga2 depends on the state from the state file. which, if my observation is true, is not updated during a restart but periodically only.
if it's true that icinga2 looses the state during reloads (is reset back to the state where the state file was updated, which happens every 5min afaik) then you'll see the correct number of check attempts / hard state, etc. in icingaweb2 but icinga2 internally will not know that the service was already in a hard state for example.
What I don't know is whether IDO will remove e.g. an ACK when it is gone in icinga2's internal state (because of the issue we suspect here). This could partly explain the situation you described.
I'd like to get an idea what's going on, so it is essential to know about the specialities on your platform. This includes the following:
icinga2 --version outputicinga2 daemon -C output).watch -n 1 'ls -la /var/lib/icinga2/icinga2.state'
does the timestamp change during reload?
systemctl reload icinga2
Are there any temporary files written with a newer timestamp, e.g. icinga2.state.23Wckf?
watch -n 1 'ls -la /var/lib/icinga2/icinga2.state*'
Can you strace the processes, and look specifically for file write operations done during the reload?
pidof icinga2
<pick one>
strace -o strace.icinga2.log -e trace=open -p <PID>
icinga2 --version
icinga2 - The Icinga 2 network monitoring daemon (version: r2.9.1-1)
Copyright (c) 2012-2018 Icinga Development Team (https://www.icinga.com/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Application information:
Installation root: /usr
Sysconf directory: /etc
Run directory: /run
Local state directory: /var
Package data directory: /usr/share/icinga2
State path: /var/lib/icinga2/icinga2.state
Modified attributes path: /var/lib/icinga2/modified-attributes.conf
Objects path: /var/cache/icinga2/icinga2.debug
Vars path: /var/cache/icinga2/icinga2.vars
PID path: /run/icinga2/icinga2.pid
System information:
Platform: Red Hat Enterprise Linux Server
Platform version: 7.4 (Maipo)
Kernel: Linux
Kernel version: 3.10.0-693.21.1.el7.x86_64
Architecture: x86_64
Build information:
Compiler: GNU 4.8.5
Build host: unknown
We use a HA-Setup, which consists of 2 physical Machines.
sestatus
SELinux status: enabled
SELinuxfs mount: /sys/fs/selinux
SELinux root directory: /etc/selinux
Loaded policy name: targeted
Current mode: permissive
Mode from config file: permissive
Policy MLS status: enabled
Policy deny_unknown status: allowed
icinga2 feature list
Disabled features: command compatlog debuglog elasticsearch gelf graphite influxdb livestatus mainlog opentsdb statusdata
Enabled features: api checker ido-mysql notification perfdata syslog
Everything works fine so far. I found nothing unusual.
icinga2 daemon -C
[2018-10-12 11:22:47 +0200] information/cli: Icinga application loader (version: r2.9.1-1)
[2018-10-12 11:22:47 +0200] information/cli: Loading configuration file(s).
[2018-10-12 11:22:48 +0200] information/ConfigItem: Committing config item(s).
[2018-10-12 11:22:49 +0200] information/ApiListener: My API identity: dxzmicinga01
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 74463 Services.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 1 SyslogLogger.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 54 HostGroups.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 1 NotificationComponent.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 1 NotificationCommand.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 90291 Notifications.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 1 IcingaApplication.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 17485 Hosts.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 1 ApiListener.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 1 PerfdataWriter.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 76 Comments.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 1 CheckerComponent.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 12 Zones.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 17 Endpoints.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 5 ApiUsers.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 255 CheckCommands.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 3 TimePeriods.
[2018-10-12 11:22:55 +0200] information/ConfigItem: Instantiated 1 User.
[2018-10-12 11:22:55 +0200] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2018-10-12 11:22:55 +0200] information/cli: Finished validating the configuration file(s).
A reload takes about 20 seconds.
watch -n 1 'ls -la /var/lib/icinga2/icinga2.state'
no timestamp-changes
watch -n 1 'ls -la /var/lib/icinga2/icinga2.state*'
no temp-files
strace -o strace.icinga2.log -e trace=open -p
+++ exited with 0 +++
Interesting, thanks. I'm not able to reproduce this locally, I can see that the temporary state file is written and renamed at the exact reload time, including an open call in dtrace. Likely it is a race condition when the child process with the validation takes too much time, and kills the parent process for some reason.
I'd still like to hear the setups from @marcofl and @ekeih to get a better picture.
# icinga2 --version
icinga2 - The Icinga 2 network monitoring daemon (version: r2.9.2-1)
Copyright (c) 2012-2018 Icinga Development Team (https://www.icinga.com/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Application information:
Installation root: /usr
Sysconf directory: /etc
Run directory: /run
Local state directory: /var
Package data directory: /usr/share/icinga2
State path: /var/lib/icinga2/icinga2.state
Modified attributes path: /var/lib/icinga2/modified-attributes.conf
Objects path: /var/cache/icinga2/icinga2.debug
Vars path: /var/cache/icinga2/icinga2.vars
PID path: /run/icinga2/icinga2.pid
System information:
Platform: Ubuntu
Platform version: 16.04.5 LTS (Xenial Xerus)
Kernel: Linux
Kernel version: 4.4.0-137-generic
Architecture: x86_64
Build information:
Compiler: GNU 5.3.1
Build host: c1410dac9c1d
#聽icinga2 daemon -C
[2018-10-15 09:49:21 +0200] information/cli: Icinga application loader (version: r2.9.2-1)
[2018-10-15 09:49:21 +0200] information/cli: Loading configuration file(s).
[2018-10-15 09:49:26 +0200] information/ConfigItem: Committing config item(s).
[2018-10-15 09:49:26 +0200] information/ApiListener: My API identity: master1.monitoring.vshn.net
[2018-10-15 09:49:36 +0200] information/WorkQueue: #4 (DaemonUtility::LoadConfigFiles) items: 0, rate: 30.6333/s (1838/min 1838/5min 1838/15min);
[2018-10-15 09:49:36 +0200] information/WorkQueue: #5 (InfluxdbWriter, influxdb) items: 0, rate: 0/s (0/min 0/5min 0/15min);
[2018-10-15 09:49:36 +0200] information/WorkQueue: #6 (ApiListener, RelayQueue) items: 0, rate: 0/s (0/min 0/5min 0/15min);
[2018-10-15 09:49:36 +0200] information/WorkQueue: #7 (ApiListener, SyncQueue) items: 0, rate: 0/s (0/min 0/5min 0/15min);
[2018-10-15 09:49:36 +0200] information/WorkQueue: #8 (IdoMysqlConnection, ido-mysql) items: 0, rate: 0/s (0/min 0/5min 0/15min);
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 59932 Services.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 4 ServiceGroups.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 7689 ScheduledDowntimes.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 1 InfluxdbWriter.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 40 HostGroups.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 1 FileLogger.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 1 NotificationComponent.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 6 NotificationCommands.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 26429 Notifications.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 1 IcingaApplication.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 971 Hosts.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 1 ApiListener.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 15614 Downtimes.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 152 Comments.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 1 CheckerComponent.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 896 Zones.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 1 ExternalCommandListener.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 895 Endpoints.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 6 ApiUsers.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 7 UserGroups.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 320 CheckCommands.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 2 TimePeriods.
[2018-10-15 09:49:40 +0200] information/ConfigItem: Instantiated 14 Users.
[2018-10-15 09:49:40 +0200] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2018-10-15 09:49:40 +0200] information/cli: Finished validating the configuration file(s).
I think the problem could be that puppet (https://github.com/Icinga/puppet-icinga2/blob/master/manifests/service.pp#L36, https://github.com/Icinga/puppet-icinga2/blob/master/manifests/params.pp#L98) uses reload and not restart. maybe the state file update on reload is broken? because I see the file being updated during a systemctl restart icinga2 but NOT on a systemctl reload icinga2
This would also explain, why we see this issue after changing from our fork of the old icinga2 puppet module to the new one.
Every 1.0s: ls -la /var/lib/icinga2/icinga2.state* Mon Oct 15 09:53:51 2018
-rw------- 1 nagios nagios 135209033 Oct 15 09:53 /var/lib/icinga2/icinga2.state
and nothing about this in the log (just the periodic writes)
[2018-10-15 09:30:53 +0200] information/ConfigObject: Dumping program state to file '/var/lib/icinga2/icinga2.state'
[2018-10-15 09:36:01 +0200] information/ConfigObject: Dumping program state to file '/var/lib/icinga2/icinga2.state'
[2018-10-15 09:43:12 +0200] information/ConfigObject: Dumping program state to file '/var/lib/icinga2/icinga2.state'
[2018-10-15 09:48:20 +0200] information/ConfigObject: Dumping program state to file '/var/lib/icinga2/icinga2.state'
Indeed. With systemctl restart icinga2 it works as it should. The state-file is written during the restart.
root@master01:/# icinga2 --version
icinga2 - The Icinga 2 network monitoring daemon (version: r2.9.2-1)
Copyright (c) 2012-2018 Icinga Development Team (https://www.icinga.com/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Application information:
Installation root: /usr
Sysconf directory: /etc
Run directory: /run
Local state directory: /var
Package data directory: /usr/share/icinga2
State path: /var/lib/icinga2/icinga2.state
Modified attributes path: /var/lib/icinga2/modified-attributes.conf
Objects path: /var/cache/icinga2/icinga2.debug
Vars path: /var/cache/icinga2/icinga2.vars
PID path: /run/icinga2/icinga2.pid
System information:
Platform: Ubuntu
Platform version: 16.04.5 LTS (Xenial Xerus)
Kernel: Linux
Kernel version: 2.6.32-042stab133.2
Architecture: x86_64
Build information:
Compiler: GNU 5.3.1
Build host: c1410dac9c1d
System health, are there any security limits, limits on I/O
No
Which features are enabled and running on the same box, do they take more CPU / IO than usual?
All components are running in virtuozzo containers. We have two masters, 3x2 satellites and a separate MySQL Master and a Slave. The host systems and the containers do not show any performance issues.
How many objects are involved here (icinga2 daemon -C output).
root@master01:/# icinga2 daemon -C
[2018-10-15 10:45:56 +0200] information/cli: Icinga application loader (version: r2.9.2-1)
[2018-10-15 10:45:56 +0200] information/cli: Loading configuration file(s).
[2018-10-15 10:45:57 +0200] information/ConfigItem: Committing config item(s).
[2018-10-15 10:45:57 +0200] warning/ApiListener: Attribute 'key_path' for object 'api' of type 'ApiListener' is deprecated and should not be used.
[2018-10-15 10:45:57 +0200] warning/ApiListener: Attribute 'ca_path' for object 'api' of type 'ApiListener' is deprecated and should not be used.
[2018-10-15 10:45:57 +0200] warning/ApiListener: Attribute 'cert_path' for object 'api' of type 'ApiListener' is deprecated and should not be used.
[2018-10-15 10:45:57 +0200] warning/ApiListener: Please read the upgrading documentation for v2.8: https://www.icinga.com/docs/icinga2/latest/doc/16-upgrading-icinga-2/
[2018-10-15 10:45:57 +0200] information/ApiListener: My API identity: foobar.live.master01
[2018-10-15 10:46:07 +0200] information/WorkQueue: #4 (DaemonUtility::LoadConfigFiles) items: 0, rate: 45.4667/s (2728/min 2728/5min 2728/15min);
[2018-10-15 10:46:07 +0200] information/WorkQueue: #6 (ApiListener, SyncQueue) items: 0, rate: 0/s (0/min 0/5min 0/15min);
[2018-10-15 10:46:07 +0200] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 0, rate: 0/s (0/min 0/5min 0/15min);
[2018-10-15 10:46:07 +0200] information/WorkQueue: #7 (IdoMysqlConnection, ido-mysql) items: 0, rate: 0/s (0/min 0/5min 0/15min);
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 65651 Services.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 1 ServiceGroup.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 605 ScheduledDowntimes.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 355 HostGroups.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 1 FileLogger.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 1 NotificationComponent.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 1 NotificationCommand.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 49649 Notifications.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 1 IcingaApplication.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 2569 Hosts.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 1 ApiListener.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 6761 Downtimes.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 54952 Dependencies.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 395 Comments.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 1 CheckerComponent.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 6 Zones.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 1 ExternalCommandListener.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 8 Endpoints.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 7 ApiUsers.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 210 UserGroups.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 14 CheckCommands.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 4 TimePeriods.
[2018-10-15 10:46:13 +0200] information/ConfigItem: Instantiated 327 Users.
[2018-10-15 10:46:13 +0200] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2018-10-15 10:46:13 +0200] information/cli: Finished validating the configuration file(s).
root@master01:/# time systemctl reload icinga2
real 0m17.984s
user 0m0.000s
sys 0m0.003s
does the timestamp change during reload?
Not during a reload, but during a restart.
Are there any temporary files written with a newer timestamp, e.g. icinga2.state.23Wckf?
Also not during a reload, but during a restart.
So as @marcofl and @K0nne already pointed out a restart may work better than a reload. But interestingly Icinga2 log did not log Dumping program state to file or Restoring program state from file during the restart.
# during a reload
root@master01:/# strace -e open -p 26885
strace: Process 26885 attached
--- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=27670, si_uid=0} ---
--- SIGCHLD {si_signo=SIGCHLD, si_code=SI_USER, si_pid=27817, si_uid=107} ---
+++ killed by SIGABRT +++
# during a restart
root@master01:/# strace -e open -p 27817
strace: Process 27817 attached
open("/var/lib/icinga2/icinga2.state.UiOfHM", O_RDWR|O_CREAT|O_EXCL, 0600) = 13
open("/var/lib/icinga2/icinga2.state.UiOfHM", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 47
open("/var/lib/icinga2/modified-attributes.conf.GisxY1", O_RDWR|O_CREAT|O_EXCL, 0600) = 13
open("/var/lib/icinga2/modified-attributes.conf.GisxY1", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 47
+++ exited with 0 +++
Update: I just realized that my strace during a reload prints out +++ killed by SIGABRT +++. Maybe this is a clue?
Thanks for the details, I wanted to get an overview if this is distribution agnostic, or maybe related to special signal handling with Systemd. One of our community members opened #6689 which includes a fix, similar to what I already had in mind from my analysis. Likely last week was too much stress, today I've found a reliable way to reproduce the issue.
You can find the steps to reproduce and test protocol in #6689, a PR is coming soon.
Please test the snapshot packages - I'll trigger them for el7 and Ubuntu 16 and let you know once available.
Packages are available, please report from your tests :-)
I can confirm that with the version 2.10.0+18.gc0398ed the state file is changed during a reload. What I don't see is the new log message implemented in https://github.com/Icinga/icinga2/pull/6691/files
(master-test-monitoring_v2)root@master1:~# grep -i reload /var/log/icinga2/icinga2.log
[2018-10-17 09:28:05 +0200] information/Application: Got reload command: Starting new instance.
[2018-10-17 09:28:06 +0200] information/Application: Reload requested, letting new process take over.
[2018-10-17 09:29:28 +0200] information/Application: Got reload command: Starting new instance.
[2018-10-17 09:29:29 +0200] information/Application: Reload requested, letting new process take over.
Maybe the logs are flushed too late and then exit() kills the stream prior to writing it. Did you test the thing with sending an acknowledgement during the reload time too?
Our issue with having old check results internally in icinga2 after a reload seems to be gone with this. Also I did things like disable active checks for a service -> reload -> active check still disabled -> enable active check -> reload -> waiting for next check -> new check results was there. also acked a critical service and the ack was not gone after a reload.
I tend to say that this fixed the issue. But maybe other reporters of this issue can confirm this too please.
would you say this snapshot (2.10.0+18.gc0398ed) is okay to use in production until a release?
Ok, thanks for testing :)
The snapshot packages only contain a regression fix for the API, and Windows build fix. Nothing critical, although I wouldn't recommend snapshot packages in production. When there's more feedback we're planning with 2.10.1 soon.
Got a similar test response with #6624, so I am inclined to say that it is fixed. I know that @K0nne won't be able to test this in production.
Hi, sorry for my late response - I was out of office the last two days.
According to another ticket we can excepct 2.10.1 later today. I will upgrade our stage environment to 2.10.1 and report back.
Thanks to everyone for this fix! :)
Most helpful comment
I think the problem could be that puppet (https://github.com/Icinga/puppet-icinga2/blob/master/manifests/service.pp#L36, https://github.com/Icinga/puppet-icinga2/blob/master/manifests/params.pp#L98) uses
reloadand notrestart. maybe the state file update on reload is broken? because I see the file being updated during asystemctl restart icinga2but NOT on asystemctl reload icinga2This would also explain, why we see this issue after changing from our fork of the old icinga2 puppet module to the new one.
and nothing about this in the log (just the periodic writes)