Occasionally icinga constantly recreates same downtime over and over again, and barely does anything else. Issue goes away after reloads.
Not sure how to reproduce.
A downtime gets created once.
[2019-05-16 23:24:05 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:07 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:07 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
Include as many relevant details about the environment you experienced the problem in
icinga2 --version):icinga2 feature list):Config validation (icinga2 daemon -C):
...
[2019-05-23 14:55:33 -0400] information/cli: Finished validating the configuration file(s).
If you run multiple Icinga 2 instances, the zones.conf file (or icinga2 object list --type Endpoint and icinga2 object list --type Zone) from all affected nodes.
object Zone "global-templates" {
global = true
}
object Zone "master" {
endpoints = ["master1", "master2"]
}
object Zone "zone1" {
endpoints = ["sat1", "sat2"]
parent = "master"
}
object Zone "zone2" {
endpoints = ["sat3", "sat4"]
parent = "master"
}
Note here master[12]/sat[1234] are simplified hostnames just to show the configuration.
Add any other context about the problem here.
We first saw this issue on 2.10.2, was hoping 2.10.4 has a fix but it doesn't. The behavior is slightly different though. In 2.10.2, issue happened on one master and two endpoints in the same zone at the same time. With 2.10.4, so far it happend twice, both on two endpoints in the same zone only, not on master.
How's that Downtime being added initially - via API action/Icinga Web 2 or generated from a ScheduledDowntime object?
Hi, thanks for looking into this.
We schedule downtimes from API on one of the master. One condition that could trigger this, was I found the downtime being recreated happened to have been scheduled on a master, when the master just started reloading, but the parent process was still processing everything. We have a large number of hosts/services, usually it takes a few minutes for the parent to finish up before the children processes take over.
Hi,
We have seen the same issue with Comments. From time to time our satellites go into spin and recreates comments about 200K times an hour. It is only two comments that are recreated over and over again. It looks like the config master disconnects from the satellite:
[2019-06-11 09:46:03 +0000] warning/TlsStream: TLS stream was disconnected.
[2019-06-11 09:46:03 +0000] warning/JsonRpcConnection: API client disconnected for identity 'eu-west-1-icinga-01.hosted.axis.com'
[2019-06-11 09:46:03 +0000] warning/ApiListener: Removing API client for endpoint 'eu-west-1-icinga-01.hosted.axis.com'. 0 API clients left.
And then the two comments are recreated:
[2019-06-11 09:46:08 +0000] information/ConfigObjectUtility: Created and activated object 'HOSTNAME!reboot-required!f4a48c8c-d6c4-4f2c-9fb6-0f1
3d5545762' of type 'Comment'.
[2019-06-11 09:46:08 +0000] information/ConfigObjectUtility: Created and activated object 'HOSTNAME!reboot-required!a1e4a26e-3e6b-45df-9fa6-5
13447d29d90' of type 'Comment'.
Brgds
This may be related to another bug. Are you using the IDO and see those replicated Downtimes/Comments there?
Hi,
We use IDO on the config master and the comments are only stored once in the database.
Did a restart of Icinga2 on the config master and got this on the satellites:
[2019-06-18 06:34:07 +0000] information/ApiListener: Applying config update from endpoint 'eu-west-1-icinga-01.hosted.axis.com' of zone 'master'.
[2019-06-18 06:34:07 +0000] information/ApiListener: Updating configuration file: /var/lib/icinga2/api/zones/global-templates//.timestamp
[2019-06-18 06:34:07 +0000] information/ApiListener: Applying configuration file update for path '/var/lib/icinga2/api/zones/global-templates' (17 Bytes). Received timestamp
'2019-06-18 06:33:55 +0000' (1560839635.996240), Current timestamp '2019-06-13 02:00:43 +0000' (1560391243.082262).
[2019-06-18 06:34:07 +0000] information/ApiListener: Updating configuration file: /var/lib/icinga2/api/zones/us-west-2//.timestamp
[2019-06-18 06:34:07 +0000] information/ApiListener: Applying configuration file update for path '/var/lib/icinga2/api/zones/us-west-2' (17 Bytes). Received timestamp '2019-06-18 06:33:55 +0000' (1560839635.993111), Current timestamp '2019-06-13 02:00:42 +0000' (1560391242.987072).
After this the create bonanza starts and continues until Icinga is restarted on the satellite.
I've started debuglog on one of our satellites and hopefully I can trigger the error while debuglog is running.
I restart the config master and the debuglog on satellite 1 shows :
[2019-06-18 06:52:18 +0000] notice/JsonRpcConnection: Received 'config::UpdateObject' message from 'SATELLITE2'
[2019-06-18 06:52:18 +0000] notice/ApiListener: Received update for object: {"config":"object Comment \"c1a064e3-9d59-4973-a018-bc1a7c8f2164\" ignore_on_error {\n\tauthor = \"admin\"\n\tentry_time = 1549964140.574786\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"HOSTNAME\"\n\tpersistent = false\n\tservice_name = \"reboot-required\"\n\ttext = \"Reboot\"\n\tversion = 1549964140.574807\n\tzone = \"us-west-2\"\n}\n","modified_attributes":{},"name":"HOSTNAME!reboot-required!c1a064e3-9d59-4973-a018-bc1a7c8f2164","original_attributes":[],"type":"Comment","version":1549964140.5748069286}
[2019-06-18 06:52:18 +0000] notice/ApiListener: Discarding config update for object 'HOSTNAME!reboot-required!c1a064e3-9d59-4973-a018-bc1a7c8f2164': Object version 1549964140.574807 is more recent than the received version 1549964140.574807.
This repeats for two comments until Icinga2 is restarted. So it looks like the error occurs when the object update comes from satellite 2
Investigate on whether your _api package is broken on the affected endpoints, this is the internal storage for runtime created objects including downtimes/comments. Fixes for this are included in the upcoming 2.11 with snapshot packages available for testing.
You can also analyse this on your own with using this guide.
It looks alright when I follow the guide.
I've also stopped icinga on the satellite, removed packages and zones from /var/lib/icinga2/api and started icinga again to make sure that it gets a completely fresh sync from the config master.
I will continue to try and trigger the error and see if I can collect any more information.
@omattsson which Icinga version are you using?
@dnsmichi We run 2.10.3-1 in production at the moment.
Hm, ok. And there are no time difference between the instances, e.g. ntp is configured and the local ntp checks don't say that the offset has gone nuts?
They use the same NTP server and the offset on satellite 1 is -0.0005632638931 secs and satellite 2 is 0.007149338722 secs. The master has an offset of 0.0002931952477 secs. I consider that to be an okay offset at least.
Ok, thanks. Since you're not the original issue author, I'm just going the detail collect route again here. I'm wondering about the "comment object route" here, like which endpoint is the source for the comment creation and which other endpoints do the routing of that message, each creating the comment ... that would be awesome if extracted from the debug logs.
@jhou4 are you using ScheduledDowntime objects which generate such runtime Downtime objects? I've now spotted a similar case where one of my masters had a local host object with services, generated a downtime from an SD, and the first master couldn't apply the object because there was a missing object reference.
[2019-06-19 16:53:58 +0200] notice/JsonRpcConnection: Received 'config::UpdateObject' message from 'master2'
[2019-06-19 16:53:58 +0200] notice/ApiListener: Received update for object: {"config":"object Downtime \"08998dc7-c0e4-46fa-a382-39b438097c75\" ignore_on_error {\n\tauthor = \"icingaadmin\"\n\tauthoritative_zone = \"master\"\n\tcomment = \"Scheduled downtime for backup\"\n\tconfig_owner = \"master2!load!backup-downtime\"\n\tduration = 0.000000\n\tend_time = 1560992400.000000\n\tentry_time = 1560955436.475581\n\tfixed = true\n\thost_name = \"master2\"\n\tscheduled_by = \"master2!load!backup-downtime\"\n\tservice_name = \"load\"\n\tstart_time = 1560988800.000000\n\ttriggered_by = \"\"\n\tversion = 1560955436.475701\n}\n","modified_attributes":{},"name":"master2!load!08998dc7-c0e4-46fa-a382-39b438097c75","original_attributes":[],"type":"Downtime","version":1560955436.475701}
[2019-06-19 16:53:58 +0200] notice/ConfigCompiler: Compiling config file: var-a/lib/icinga2/api/packages/_api/b01e7361-78b3-425d-9c35-7cd214fd2c7d/conf.d/downtimes/master2!load!08998dc7-c0e4-46fa-a382-39b438097c75.conf
[2019-06-19 16:53:58 +0200] debug/configitem: Committing 1 new items.
[2019-06-19 16:53:58 +0200] notice/WorkQueue: Spawning WorkQueue threads for 'ConfigObjectUtility::CreateObject'
[2019-06-19 16:53:58 +0200] debug/ConfigItem: Commit called for ConfigItem Type=Downtime, Name=08998dc7-c0e4-46fa-a382-39b438097c75
[2019-06-19 16:53:58 +0200] notice/ConfigObject: Ignoring config object '08998dc7-c0e4-46fa-a382-39b438097c75' of type 'Downtime' due to errors: Error: Validation failed for object 'master2!load!08998dc7-c0e4-46fa-a382-39b438097c75' of type 'Downtime'; Attribute 'host_name': Object 'master2' of type 'Host' does not exist.
Location: in var-a/lib/icinga2/api/packages/_api/b01e7361-78b3-425d-9c35-7cd214fd2c7d/conf.d/downtimes/master2!load!08998dc7-c0e4-46fa-a382-39b438097c75.conf: 1:0-1:69
var-a/lib/icinga2/api/packages/_api/b01e7361-78b3-425d-9c35-7cd214fd2c7d/conf.d/downtimes/master2!load!08998dc7-c0e4-46fa-a382-39b438097c75.conf(1): object Downtime "08998dc7-c0e4-46fa-a382-39b438097c75" ignore_on_error {
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
var-a/lib/icinga2/api/packages/_api/b01e7361-78b3-425d-9c35-7cd214fd2c7d/conf.d/downtimes/master2!load!08998dc7-c0e4-46fa-a382-39b438097c75.conf(2): author = "icingaadmin"
var-a/lib/icinga2/api/packages/_api/b01e7361-78b3-425d-9c35-7cd214fd2c7d/conf.d/downtimes/master2!load!08998dc7-c0e4-46fa-a382-39b438097c75.conf(3): authoritative_zone = "master"
[2019-06-19 16:53:58 +0200] debug/configitem: Committed 1 items of type 'Downtime'.
[2019-06-19 16:53:58 +0200] debug/configitem: Committed 1 items.
[2019-06-19 16:53:58 +0200] information/ConfigObjectUtility: Created and activated object 'master2!load!08998dc7-c0e4-46fa-a382-39b438097c75' of type 'Downtime'.
[2019-06-19 16:53:58 +0200] notice/WorkQueue: Stopped WorkQueue threads for 'ConfigObjectUtility::CreateObject'
That would also explain the always same error message, since the object never is really created on the endpoint.
In our environment most downtimes are scheduled from schedule-downtime API call. Occasionally user could create adhoc downtime from Icingaweb. We do not use ScheduledDowntime(there is only a sample ScheduledDowntime rule in downtimes.conf, but I verified it did not create any object).
Ok, then a different thought - is the object where the downtime is created for, really existing on the first master/satellite endpoint which seems to log a loop of created objects?
Your initial log shows a host object, which might be defined not inside the master zone, can you share the output of
icinga2 object list --type Host --name appbig-rr-931
on both masters?
Unless configuration was out of sync somehow, I would assume that the object existed on masters/satellites. Didn't verify that particular host object last time when I looked at it. Will keep it in mind and do that next time.
Here are the outputs from two masters.
Config master:
iob1-cas-icinga01 /etc/icinga2/zones.d/global-templates/base-checks # icinga2 object list --type Host --name appbig-rr-931
Object 'appbig-rr-931' of type 'Host':
% declared in '/etc/icinga2/zones.d/nj/hosts.conf', lines 114751:1-114751:27
Second master:
inj1-cas-icinga01 /var/log/icinga2 # icinga2 object list --type Host --name appbig-rr-931
Object 'appbig-rr-931' of type 'Host':
% declared in '/var/lib/icinga2/api/zones/nj/_etc/hosts.conf', lines 114751:1-114751:27
We had a new occurrence last night. It was auto-fixed(we put in a script to monitor and auto-fix the condition). So I couldn't tell if configuration was out of sync. But there were some extra errors, which hopefully can help to identify the cuase:
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:16 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:16 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:16 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:16 -0400] critical/config: Error: An object with type 'Downtime' and name 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' already exists (in /var/lib/icinga2/api/packages/_api/b334a484-3d9e-41b2-b491-1e899fd6af91/conf.d/downtimes/p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c.conf: 1:0-1:69), new declaration: in /var/lib/icinga2/api/packages/_api/b334a484-3d9e-41b2-b491-1e899fd6af91/conf.d/downtimes/p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c.conf: 1:0-1:69
Location: in /var/lib/icinga2/api/packages/_api/b334a484-3d9e-41b2-b491-1e899fd6af91/conf.d/downtimes/p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c.conf: 1:0-1:69
/var/lib/icinga2/api/packages/_api/b334a484-3d9e-41b2-b491-1e899fd6af91/conf.d/downtimes/p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c.conf(1): object Downtime "492f19a6-f587-485b-9f4c-faa1b322169c" ignore_on_error {
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/var/lib/icinga2/api/packages/_api/b334a484-3d9e-41b2-b491-1e899fd6af91/conf.d/downtimes/p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c.conf(2): author = "icinga2adm"
/var/lib/icinga2/api/packages/_api/b334a484-3d9e-41b2-b491-1e899fd6af91/conf.d/downtimes/p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c.conf(3): comment = "OS REBOOT"
[2019-06-24 22:57:16 -0400] critical/config: 1 error
[2019-06-24 22:57:16 -0400] critical/ApiListener: Could not create object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c':
[2019-06-24 22:57:16 -0400] critical/ApiListener: Error: An object with type 'Downtime' and name 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' already exists (in /var/lib/icinga2/api/packages/_api/b334a484-3d9e-41b2-b491-1e899fd6af91/conf.d/downtimes/p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c.conf: 1:0-1:69), new declaration: in /var/lib/icinga2/api/packages/_api/b334a484-3d9e-41b2-b491-1e899fd6af91/conf.d/downtimes/p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c.conf: 1:0-1:69
Location: in /var/lib/icinga2/api/packages/_api/b334a484-3d9e-41b2-b491-1e899fd6af91/conf.d/downtimes/p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c.conf: 1:0-1:69
[2019-06-24 22:57:16 -0400] warning/PluginCheckTask: Check command for object 'nj-dfm3!check_ntp_proc' (PID: 67154, arguments: '/usr/lib64/nagios/plugins/check_nrpe' '-H' 'nj-dfm3' '-c' 'check_ntp_proc' '-t' '30') terminated with exit code 255, output: connect to address 10.126.187.94 port 5666: Connection refused
connect to host nj-dfm3 port 5666: Connection refused
[2019-06-24 22:57:16 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:16 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:16 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:16 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:16 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:16 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:16 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
...
Above logs were from one of the satellites. The same downtime was also being created repeatedly on two masters and the other satellite in the same zone at the same time. Below was the configure master(where the downtime was initially created via api), it started 10 minutes after the downtime was created:
iob1-cas-icinga01 /var/log/icinga2 # grep p203 icinga2.log-20190625 |more
[2019-06-24 22:47:33 -0400] information/ConfigObjectUtility: Created and activated object 'p203!c86ed88e-2dee-4773-8151-e0c68d9c34df' of type 'Downtime'.
[2019-06-24 22:47:35 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!53e5662c-aa4f-470a-a67e-8bb72a88916b' of type 'Downtime'.
[2019-06-24 22:47:38 -0400] information/Checkable: Checking for configured notifications for object 'p203!check_nrpe_client'
[2019-06-24 22:47:38 -0400] information/Checkable: Checking for configured notifications for object 'p203'
[2019-06-24 22:47:38 -0400] information/Checkable: Checking for configured notifications for object 'p203'
[2019-06-24 22:57:13 -0400] information/ConfigObjectUtility: Created and activated object 'p203!45aed60e-cd3c-446a-90e4-9aac8e179c37' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/Checkable: Checking for configured notifications for object 'p203'
[2019-06-24 22:57:15 -0400] information/Checkable: Checking for configured notifications for object 'p203'
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/Checkable: Checking for configured notifications for object 'p203!check_nrpe_client'
[2019-06-24 22:57:15 -0400] information/Checkable: Checking for configured notifications for object 'p203!check_nrpe_client'
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/Checkable: Checking for configured notifications for object 'p203!check_nrpe_client'
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/Checkable: Checking for configured notifications for object 'p203!check_nrpe_client'
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/Checkable: Checking for configured notifications for object 'p203!check_nrpe_client'
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/Checkable: Checking for configured notifications for object 'p203!check_nrpe_client'
...
Correction, the one created at 22:47:35 was actually different from the repeating one starting at 22:57:15.
Could it be /var/lib/icinga2/api/packages/_api/b334a484-3d9e-41b2-b491-1e899fd6af91/conf.d/downtimes/$problematicDowntime.conf gets deleted and recreated after every load?
I'm looking at the code but don't understand how this happens after the Object Registration.
The issue goes away after a reload("systemctl reload icinga2" command).
@jhou4 can you please collect all the debug logs from both instances when this exact situation happens again? (or, if you have them already).
Here's a safe place to upload, no-one except you and us can see the files. Thanks. https://nextcloud.icinga.com/index.php/s/eLDRTLyKLeA2mEG
It may be a little challenging to collect debug logs due to the fast rate it grows.
But I can try logrotate. Will do some testing. Just want to confirm, you want the debug logs from one endpoint in each zone, not the two endpoints in the same zone, correct?
It looks like we can't rotate debug.log without restarting icinga2 service. The disk space I have available can hold a few hours' logs. That means I will have to restart icinga2 every few hours, which is not very practical, especially considering this issue only happens once or twice a month.
Any suggestions on how to collect the debug logs?
you want the debug logs from one endpoint in each zone, not the two endpoints in the same zone, correct?
All of them, if possible.
It looks like we can't rotate debug.log without restarting icinga2 service.
There's a special signal which is sent by logrotated - Icinga then knows to release the file handles while logrotate already moved the files.
/bin/kill -USR1 $(pidof icinga2) 2> /dev/null || true
Btw, anyone who experiences this issue, is invited to upload the debug logs to the nextcloud URL. The more logs are there, the better the error can be analysed.
That works. Thanks.
I set it up on all our Icinga instances. Now just need to wait until it happens again.
We got lucky, the issue happened at 01:28 this AM. I uploaded two debug logs(one from the config master, the other one was from a satellite in the affected zone) for now due to the size. The uploaded files are showing a "!" sign after finishing. Not sure if it is normal. Please check.
Also, let me know if you need logs from other icinga nodes.
Thanks.
Didn't work unfortunately. How big is the size of the compressed logs?
The compressed files were 1.3G/0.5G. I tried again but still got the same thing, it took a while, then gave the "!" sign but didn't say what was wrong though.
Ok, we had a limit there, it is now raised and should work. Sorry for the trouble.
I just uploaded the file again, result was still the same so I am not sure if it worked. Please check. Thanks.
Didn't work, and we re-tested our nextcloud instance with a dummy file of 1.9 GB. Seems your connection causes trouble here. Do you have the chance to put it somewhere safe and send me the URL to michael.[email protected]?
Company network may have firewall blocking the upload indeed. I am looking into this.
Sorry for the delay. I am still trying to get an OK for sharing the data.
With comments (acknowledgement) only the set-acknowledgement and clear-acknowledgement events are logged. They are relevant to the Checkable they belong to but not the comment.
The issue is either with the config sync or the events sent during runtime. I'll investigate under which circumstances we are sending an UpdateObject event, maybe we are sending one when we shouldn't which then easily could be faulty and confusing.
I ran into the bug and was able to extract some information with the help of @dnsmichi
Master Master setup, henceforth named M1 and M2. M1 is the config master. Both instances are 2.10.5 + a few debug log messages patched in. They are not under much load and have ntpd active. M2 crashed with a segfault (my mistake) previously to the bug occurring.
[2019-07-17 11:04:11 +0200] information/ApiListener: New client connection from [127.0.0.1]:35686 (no client certificate)
[2019-07-17 11:04:11 +0200] information/HttpServerConnection: Request: POST /v1/actions/acknowledge-problem?host=horst (from [127.0.0.1]:35686), user: root)
[2019-07-17 11:04:11 +0200] information/ConfigObjectUtility: Created and activated object 'horst!291e38d4-c273-4013-bfdf-d2233a51360d' of type 'Comment'.
[2019-07-17 11:04:11 +0200] critical/ApiListener: Relaying 'config::UpdateObject' message. Write log is 0
[2019-07-17 11:04:11 +0200] critical/ApiListener: Relaying 'event::SetAcknowledgement' message. Write log is 1
[2019-07-17 11:04:11 +0200] information/HttpServerConnection: HTTP client disconnected (from [127.0.0.1]:35686)
[2019-07-17 11:04:11 +0200] information/ConfigObjectUtility: Created and activated object 'horst!291e38d4-c273-4013-bfdf-d2233a51360d' of type 'Comment'.
[2019-07-17 11:04:11 +0200] critical/ApiListener: Relaying '0x7f2b740a9600' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 11:04:11 +0200] critical/ApiListener: Relaying '0x7f2b740aaee0' message from '0x7f2b74001870' to 'master' Want log: 0 need log: 0
[2019-07-17 11:04:11 +0200] critical/ApiListener: Relaying '0x7f2b740012b0' message from '0x7f2b74001910' to 'master' Want log: 1 need log: 0
[2019-07-17 11:09:46 +0200] critical/ApiListener: Relaying '{
jsonrpc = "2.0"
method = "event::ClearAcknowledgement"
params = {
host = "horst"
}
ts = 1563354586.692563
}' message from '0x7f3da8001280' to 'master' Want log: 1 need log: 0
[2019-07-17 11:09:46 +0200] information/ConfigObjectUtility: Created and activated object 'horst!291e38d4-c273-4013-bfdf-d2233a51360d' of type 'Comment'.
[2019-07-17 11:09:46 +0200] critical/ApiListener: Relaying 'config::UpdateObject' message. Write log is 0
[2019-07-17 11:09:46 +0200] critical/ApiListener: Relaying 'config::UpdateObject' message. Write log is 0
[2019-07-17 11:09:46 +0200] critical/ApiListener: Relaying 'config::DeleteObject' message. Write log is 0
[2019-07-17 11:09:46 +0200] information/ConfigObjectUtility: Created and activated object 'horst!291e38d4-c273-4013-bfdf-d2233a51360d' of type 'Comment'.
[2019-07-17 11:09:46 +0200] critical/ApiListener: Relaying 'config::UpdateObject' message. Write log is 0
[2019-07-17 11:09:46 +0200] critical/ApiListener: Relaying 'config::UpdateObject' message. Write log is 0
[2019-07-17 11:09:46 +0200] critical/ApiListener: Relaying 'config::DeleteObject' message. Write log is 0
[2019-07-17 11:09:46 +0200] information/ConfigObjectUtility: Created and activated object 'horst!291e38d4-c273-4013-bfdf-d2233a51360d' of type 'Comment'.
....
M2:
Starts with a single Create followed by Delete before switching to two creates
[2019-07-17 11:09:46 +0200] critical/ApiListener: Relaying '{
jsonrpc = "2.0"
method = "config::UpdateObject"
params = {
config = "object Comment \"291e38d4-c273-4013-bfdf-d2233a51360d\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563354251.591243\n\tentry_type = 4.000000\n\texpire_time =
0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563354251.591259\n\tzone = \"master\"\n}\n"
modified_attributes = {
}
name = "horst!291e38d4-c273-4013-bfdf-d2233a51360d"
original_attributes = [ ]
type = "Comment"
version = 1563354251.591259
}
ts = 1563354586.626437
}' message from '0' to 'master' Want log: 0 need log: 1
[2019-07-17 11:09:46 +0200] critical/ApiListener: Relaying '{
jsonrpc = "2.0"
method = "config::DeleteObject"
params = {
name = "horst!291e38d4-c273-4013-bfdf-d2233a51360d"
type = "Comment"
version = 1563354251.591259
}
ts = 1563354586.693694
}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 11:09:46 +0200] information/ConfigObjectUtility: Created and activated object 'horst!291e38d4-c273-4013-bfdf-d2233a51360d' of type 'Comment'.
[2019-07-17 11:09:46 +0200] critical/ApiListener: Relaying '{
jsonrpc = "2.0"
method = "config::UpdateObject"
params = {
config = "object Comment \"291e38d4-c273-4013-bfdf-d2233a51360d\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563354251.591243\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563354251.591259\n\tzone = \"master\"\n}\n"
modified_attributes = {
}
name = "horst!291e38d4-c273-4013-bfdf-d2233a51360d"
original_attributes = [ ]
type = "Comment"
version = 1563354251.591259
}
ts = 1563354586.713520
}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 11:09:46 +0200] critical/ApiListener: Relaying '{
jsonrpc = "2.0"
method = "config::UpdateObject"
params = {
config = "object Comment \"291e38d4-c273-4013-bfdf-d2233a51360d\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563354251.591243\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563354251.591259\n\tzone = \"master\"\n}\n"
modified_attributes = {
}
name = "horst!291e38d4-c273-4013-bfdf-d2233a51360d"
original_attributes = [ ]
type = "Comment"
version = 1563354251.591259
}
ts = 1563354586.718544
}' message from '0x7f3da8001640' to 'master' Want log: 0 need log: 0
[2019-07-17 11:09:46 +0200] critical/ApiListener: Relaying '{
jsonrpc = "2.0"
method = "config::DeleteObject"
params = {
name = "horst!291e38d4-c273-4013-bfdf-d2233a51360d"
type = "Comment"
version = 1563354251.591259
}
ts = 1563354586.718717
}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 11:09:46 +0200] information/ConfigObjectUtility: Created and activated object 'horst!291e38d4-c273-4013-bfdf-d2233a51360d' of type 'Comment'.
...
Note that the version stays the same but the timestamp changes.
The issue resolved when we tried to attach gdb on M1 to the process and thereby stopped execution:
[2019-07-17 11:40:25 +0200] critical/ApiListener: Relaying 'config::UpdateObject' message. Write log is 0
[2019-07-17 11:40:25 +0200] critical/ApiListener: Relaying 'config::UpdateObject' message. Write log is 0
[2019-07-17 11:40:25 +0200] critical/ApiListener: Relaying 'config::DeleteObject' message. Write log is 0
[2019-07-17 11:42:54 +0200] information/Application: We jumped forward in time: 151.2 seconds
[2019-07-17 11:42:54 +0200] warning/JsonRpcConnection: API client disconnected for identity 'icinga2-centos7-dev.vagrant.demo.icinga.com'
[2019-07-17 11:42:54 +0200] warning/ApiListener: Removing API client for endpoint 'icinga2-centos7-dev.vagrant.demo.icinga.com'. 0 API clients left.
[2019-07-17 11:42:54 +0200] warning/TlsStream: TLS stream was disconnected.
[2019-07-17 11:42:54 +0200] warning/TlsStream: TLS stream was disconnected.
While this was happening the file in question was deleted and an created all the time (M2):
[root@icinga2-centos7-dev var]# ls lib/icinga2/api/packages/_api/8f06cf47-5301-4111-994e-d754b4ec7fba/conf.d/comments/horst\!291e38d4-c273-4013-bfdf-d2233a51360d.conf
ls: cannot access lib/icinga2/api/packages/_api/8f06cf47-5301-4111-994e-d754b4ec7fba/conf.d/comments/horst!291e38d4-c273-4013-bfdf-d2233a51360d.conf: No such file or directory
[root@icinga2-centos7-dev var]# ls lib/icinga2/api/packages/_api/8f06cf47-5301-4111-994e-d754b4ec7fba/conf.d/comments/horst\!291e38d4-c273-4013-bfdf-d2233a51360d.conf
lib/icinga2/api/packages/_api/8f06cf47-5301-4111-994e-d754b4ec7fba/conf.d/comments/horst!291e38d4-c273-4013-bfdf-d2233a51360d.conf
[root@icinga2-centos7-dev var]# ls lib/icinga2/api/packages/_api/8f06cf47-5301-4111-994e-d754b4ec7fba/conf.d/comments/horst\!291e38d4-c273-4013-bfdf-d2233a51360d.conf
lib/icinga2/api/packages/_api/8f06cf47-5301-4111-994e-d754b4ec7fba/conf.d/comments/horst!291e38d4-c273-4013-bfdf-d2233a51360d.conf
While I don't think datetime offset between the servers has anything to do with this, this is certainly a timing issue. The version (which did not change) is the most important change indicator, the check for it may be problematic.
Other than that the existence of two CreateObject event is notable. One has the instance it self as origin, the other that of the remote. It would be interesting to know if they are the same ones as on the other master
I will play around with the versions a bit, maybe I'm able to force the bug this way. If so it will go a long way to finally put this issue to rest.
I am glad to see that we are getting closer to the root cause of this issue. Sorry I couldn't provide the full debug logs. But maybe I can extract some info from the logs if that helps. Just let me know. Thanks.
Another run in with this + more log info:
[2019-07-17 15:28:22 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::DeleteObject","params":{"name":"horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47","type":"Comment","version":1563370041.8454220295},"ts":1563370102.3430480957}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 15:28:22 +0200] critical/ApiListener: Received update for object: {"config":"object Comment \"f7c9e42a-63ca-4e8e-92e3-b6e8462dae47\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563369981.845314\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563370041.845422\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47","original_attributes":[],"type":"Comment","version":1563370041.8454220295}
[2019-07-17 15:28:22 +0200] critical/DEBUG: CUOH: object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47' is active
Context:
(0) Activating object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47' of type 'Comment'
[2019-07-17 15:28:22 +0200] information/ConfigObjectUtility: Created and activated object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47' of type 'Comment'.
[2019-07-17 15:28:22 +0200] critical/DEBUG: CUOH: object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47' is active
[2019-07-17 15:28:22 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::UpdateObject","params":{"config":"object Comment \"f7c9e42a-63ca-4e8e-92e3-b6e8462dae47\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563369981.845314\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563370041.845422\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47","original_attributes":[],"type":"Comment","version":1563370041.8454220295},"ts":1563370102.3733160496}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 15:28:22 +0200] critical/ApiListener: Discarding config update for object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47': Object version 1563370041.845422 is more recent than the received version 1563370041.845422.
[2019-07-17 15:28:22 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::UpdateObject","params":{"config":"object Comment \"f7c9e42a-63ca-4e8e-92e3-b6e8462dae47\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563369981.845314\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563370041.845422\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47","original_attributes":[],"type":"Comment","version":1563370041.8454220295},"ts":1563370102.3735001087}' message from '0x7f76b0023380' to 'master' Want log: 0 need log: 0
[2019-07-17 15:28:22 +0200] critical/DEBUG: CUOH: object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47' is inactive and Extension is set to:1
Context:
(0) Deactivating object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47' of type 'Comment'
[2019-07-17 15:28:22 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::DeleteObject","params":{"name":"horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47","type":"Comment","version":1563370041.8454220295},"ts":1563370102.3739030361}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 15:28:22 +0200] critical/ApiListener: Received update for object: {"config":"object Comment \"f7c9e42a-63ca-4e8e-92e3-b6e8462dae47\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563369981.845314\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563370041.845422\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47","original_attributes":[],"type":"Comment","version":1563370041.8454220295}
[2019-07-17 15:28:22 +0200] critical/DEBUG: CUOH: object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47' is active
Context:
(0) Activating object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47' of type 'Comment'
[2019-07-17 15:28:22 +0200] information/ConfigObjectUtility: Created and activated object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47' of type 'Comment'.
It's flipping between activation and deactiviation, triggering the activeChanged event as Michi suspected. In this case I had the following patches compiled in in hopes of forcing this (there is no reliable way to reproduce this yet)
+++ b/lib/remote/configobjectutility.cpp
@@ -89,6 +89,8 @@ String ConfigObjectUtility::CreateObjectConfig(const Type::Ptr& type, const Stri
/* update the version for config sync */
allAttrs->Set("version", Utility::GetTime());
+ allAttrs->Set("version", Utility::GetTime()+60);
+
+++ b/lib/remote/apilistener-configsync.cpp
@@ -191,7 +195,7 @@ Value ApiListener::ConfigUpdateObjectAPIHandler(const MessageOrigin::Ptr& origin
}
/* keep the object version in sync with the sender */
- object->SetVersion(objVersion, false, origin);
+ object->SetVersion(objVersion+2, false, origin);
And hank you @jhou4 for your help so far. I think we have a good idea of where this is coming from but since we still can't reproduce this as needed, keep an eye out for this bug. Anything regarding the circumstances of this is interesting, eg. does it also happen during runtime or shutdown? Have other objects that Downtimes and Comments been affected etc.
The main question is - why would SetVersion() generate a Delete event. That's not yet clear to me.
[2019-07-17 15:28:22 +0200] critical/DEBUG: CUOH: object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47' is inactive and Extension is set to:1
C
Trace this event down. Maybe it is related to the reload handling.
Since it's only Downtimes and Comments (Ack comments in the cases I observed) it is also possible this happens as intended. Icinga starts, Host is checked and up again, Acknowledgement deleted.
Logs should prove that ;-)

In this case I had two acknowledgements on the Host, so double the fun. This also makes me think it's not something specific to that one object but tied to the ClearAcks event.
My workflow to semi reliably reproduce this:
object Host "horst" { check_command = "dummy"; check_interval=10s }M2 (restarted a bunch):
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"event::ClearAcknowledgement","params":{"host":"horst"},"ts":1563375558.5422809124}' message from '0x7fe470000fb0' to 'master' Want log: 1 need log: 0
[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' is inactive and Extension is set to:1
Context:
(0) Deactivating object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' of type 'Comment'
[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' is inactive and Extension is set to:1
Context:
(0) Deactivating object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' of type 'Comment'
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::DeleteObject","params":{"name":"horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c","type":"Comment","version":1563375558.2112770081},"ts":1563375558.5436320305}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::DeleteObject","params":{"name":"horst!1077e401-e578-4096-9f87-afcd85ac5e2e","type":"Comment","version":1563375351.7715198994},"ts":1563375558.5441069603}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"event::SetNextCheck","params":{"host":"horst","next_check":1563375568.0342299938},"ts":1563375558.5444951057}' message from '0x7fe4700017a0' to 'master' Want log: 1 need log: 0
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"event::CheckResult","params":{"cr":{"active":true,"check_source":"dfws","command":null,"execution_end":1563375556.9986350536,"execution_start":1563375556.9986350536,"exit_status":0.0,"output":"Check was successful.","performance_data":[],"schedule_end":1563375556.9986488819,"schedule_start":1563375556.998208046,"state":0.0,"ttl":0.0,"type":"CheckResult","vars_after":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0},"vars_before":{"attempt":1.0,"reachable":true,"state":2.0,"state_type":0.0}},"host":"horst"},"ts":1563375558.5446588993}' message from '0x7fe4700017a0' to 'master' Want log: 1 need log: 0
[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' is active
Context:
(0) Activating object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' of type 'Comment'
[2019-07-17 16:59:18 +0200] information/ConfigObjectUtility: Created and activated object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' of type 'Comment'.
[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' is active
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::UpdateObject","params":{"config":"object Comment \"1077e401-e578-4096-9f87-afcd85ac5e2e\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563375291.771488\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563375351.771520\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!1077e401-e578-4096-9f87-afcd85ac5e2e","original_attributes":[],"type":"Comment","version":1563375351.7715198994},"ts":1563375558.5641889572}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::UpdateObject","params":{"config":"object Comment \"1077e401-e578-4096-9f87-afcd85ac5e2e\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563375291.771488\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563375351.771520\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!1077e401-e578-4096-9f87-afcd85ac5e2e","original_attributes":[],"type":"Comment","version":1563375351.7715198994},"ts":1563375558.5649349689}' message from '0x7fe470001db0' to 'master' Want log: 0 need log: 0
[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' is active
Context:
(0) Activating object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' of type 'Comment'
[2019-07-17 16:59:18 +0200] information/ConfigObjectUtility: Created and activated object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' of type 'Comment'.
M1 (left running):
[2019-07-17 16:59:18 +0200] warning/ApiListener: Ignoring config update for zone 'master' because we have an authoritative version of the zone's config.
[2019-07-17 16:59:18 +0200] critical/ApiListener: Received update for object: {"config":"object Comment \"1077e401-e578-4096-9f87-afcd85ac5e2e\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_tim
e = 1563375291.771488\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563375351.771520\n\tzone = \"master\"\n}\n"
,"modified_attributes":{},"name":"horst!1077e401-e578-4096-9f87-afcd85ac5e2e","original_attributes":[],"type":"Comment","version":1563375351.7715198994}
[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' is active
Context:
(0) Activating object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' of type 'Comment'
[2019-07-17 16:59:18 +0200] information/ConfigObjectUtility: Created and activated object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' of type 'Comment'.
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::UpdateObject","params":{"config":"object Comment \"1077e401-e578-4096-9f87-afcd85ac5e2e\" ignore_on_
error {\n\tauthor = \"Bob\"\n\tentry_time = 1563375291.771488\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563
375351.771520\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!1077e401-e578-4096-9f87-afcd85ac5e2e","original_attributes":[],"type":"Comment","version":1563375351.7715198994},"t
s":1563375558.5496549606}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' is active
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::UpdateObject","params":{"config":"object Comment \"1077e401-e578-4096-9f87-afcd85ac5e2e\" ignore_on_
error {\n\tauthor = \"Bob\"\n\tentry_time = 1563375291.771488\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563
375351.771520\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!1077e401-e578-4096-9f87-afcd85ac5e2e","original_attributes":[],"type":"Comment","version":1563375351.7715198994},"t
s":1563375558.550219059}' message from '0x7f5fa0017fe0' to 'master' Want log: 0 need log: 0
[2019-07-17 16:59:18 +0200] critical/ApiListener: Discarding config update for object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e': Object version 1563375351.771520 is more recent than the receiv
ed version 1563375351.771520.
[2019-07-17 16:59:18 +0200] critical/ApiListener: Received update for object: {"config":"object Comment \"ed1a05e3-a9e2-48dc-8edb-91758dbbc75c\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563375498.211203\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563375558.211277\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c","original_attributes":[],"type":"Comment","version":1563375558.2112770081}
[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' is active
Context:
(0) Activating object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' of type 'Comment'
[2019-07-17 16:59:18 +0200] information/ConfigObjectUtility: Created and activated object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' of type 'Comment'.
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::UpdateObject","params":{"config":"object Comment \"ed1a05e3-a9e2-48dc-8edb-91758dbbc75c\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563375498.211203\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563375558.211277\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c","original_attributes":[],"type":"Comment","version":1563375558.2112770081},"ts":1563375558.5557129383}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' is active
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::UpdateObject","params":{"config":"object Comment \"ed1a05e3-a9e2-48dc-8edb-91758dbbc75c\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563375498.211203\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563375558.211277\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c","original_attributes":[],"type":"Comment","version":1563375558.2112770081},"ts":1563375558.5573120117}' message from '0x7f5fa0017ae0' to 'master' Want log: 0 need log: 0
[2019-07-17 16:59:18 +0200] critical/ApiListener: Discarding config update for object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c': Object version 1563375558.211277 is more recent than the received version 1563375558.211277.
[2019-07-17 16:59:18 +0200] critical/ApiListener: Received update for object: {"config":"object Downtime \"ce5ba6e5-74fd-4ddb-a53f-ae0e23f3f379\" ignore_on_error {\n\tauthor = \"baum\"\n\tcomment = \"tree\"\n\tconfig_owner = \"\"\n\tduration = 0.000000\n\tend_time = 1563370435.000000\n\tentry_time = 1563370075.939388\n\tfixed = true\n\thost_name = \"horst\"\n\tscheduled_by = \"\"\n\tstart_time = 1563370075.000000\n\ttriggered_by = \"\"\n\tversion = 1563370135.939455\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!ce5ba6e5-74fd-4ddb-a53f-ae0e23f3f379","original_attributes":[],"type":"Downtime","version":1563370135.9394550323}
[2019-07-17 16:59:18 +0200] critical/ApiListener: Discarding config update for object 'horst!ce5ba6e5-74fd-4ddb-a53f-ae0e23f3f379': Object version 1563370135.939455 is more recent than the received version 1563370135.939455.
[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' is inactive and Extension is set to:1
Context:
(0) Deactivating object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' of type 'Comment'
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::DeleteObject","params":{"name":"horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c","type":"Comment","version":1563375558.2112770081},"ts":1563375558.5586109161}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' is inactive and Extension is set to:1
Context:
(0) Deactivating object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' of type 'Comment'
@Crunsher - In the past with version 2.10.2, this issue(when it happened) always starts shortly after a reload. With version 2.10.4, it started on the primary master or satellites without an obvious reload(although I am not able to verify if the second master actually crashed right before that).
I have only seen this behavior with downtimes, not other objects.
@Crunsher I don't exactly see what the logs are telling here. Our main focus should be on why a Comment/Downtime would be deleted thus following the create and delete again. Obviously it has to happen in the same moment, with a delay (stop or gdb attached process) it stops immediately. This is why I do think that it is related to the object version and activation handling in this specific scope.
We should also try whether the version-patch works to trigger the Downtime loop.
@jhou4 how is it going with the debug logs on your side?
The removal of the Acknowledgement is the trigger, something similar probably has to happen for downtimes. I had a downtime while testing this but only the Ack comment would be duplicated and only when the ClearAcknowledgement Event was involved.
Since this issue is about downtimes, we should try to reproduce it with this type as well. This points to specific object activation/deactivation and our handlers with receiving/denying config updates based on their version.
@dnsmichi There is some internal concern with sharing the logs. Very likely we will have to scrub the logs before uploading. I am still trying to get it approved. Sorry about that.
Reproducing this with Downtimes has turned out more complicated. Analogous to Comments I think this is tied to the automatic deletion, in this case the Expire handler.
With Downtimes it runs on a timer and not on demand, this makes getting to that cirtical point harder to predict.
Something I did observe is:
I think this is close to the issue but it's missing an ingredient X, possibly high load or a unfinished startup.
Cancelling a downtime via remove-downtime should possibly have the same impact. Otherwise you might just inject some code which automatically expires the downtime after X seconds to have this predictable in your tests.
I can now very reliably reproduce this with Acknowledgements.
The following happens:
You get where I'm going with this.
See also, this handy diagram:

This is further supported by making the replay log on M1 read only suppressing the bug. What makes it weird is that Downtimes do not have an event like that.
Final remarks before I sign off for the weekend (So I remember what's going on on Monday):
Might be related: #7094
We just uploaded debug logs from two masters and two satellites.
The files have about 10 minutes' logs around the time when the issue started. Please take a look.
Sorry for the delay.
Thanks. I've shared it with the responsible team members, @Crunsher will come to you with more details asap.
The issue has gotten very long, so here is a summary.
Downtimes and Comments would be created repeatedly, spamming the log and IDO history tables.
Only zones with more than one endpoint are affected. Further the Comments and Downtimes are removed/expired. I have seen this issue with versions of icinga 2.10.x upwards.
The easiest way to reproduce this by patching the code to force a version mismatch and have the following Host in a zone with two Endpoints
object Host "horst" {
check_command = "dummy"
check_interval=10s
max_check_attempts = 1
}
Then, with both instances running, have the Host go down via the API and acknowledge the problem:
curl -k -s -u root:icinga -H 'Accept: application/json' \
-X POST 'https://localhost:5665/v1/actions/process-check-result' -d \
'{"type": "Host", "filter": "host.name==\"horst\"", "exit_status": 1, "plugin_output": "Host is not available."}'
curl -k -s -u root:icinga -H 'Accept: application/json' \
-X POST 'https://localhost:5665/v1/actions/acknowledge-problem' -d \
'{"type": "Host", "filter": "host.name==\"horst\"", "author": "admin", "comment": "Ok"}'
Now start restarting one of the masters until the problem occurs. The aim is to restart just when the other master checks the Host, recovers and clears the acknowledgement.
The object version is a timestamp used to ignore changes to objects which we have a newer version. The patch is only there to trigger the beginning of the loop. Using a timestamp is a good first filter for objects, but considering things may go wrong under high load a newer timestamp might not mean the object is actually different.
This does not mean the issue is with the version per se, the objects bouncing around are in the case of @jhou4 the same through the log.
So far my money was on what I described here but the log from @jhou4 does not match this. I am missing two log messages especially:
notice/ApiListener: Processing config update for object... for updates from the other master and notice/Downtime: Removed downtime ... by RemoveDowntime triggered by the DowntimeExpireHandler.
I have multiple logs from different sources, all with the same result but all slightly different. This problem is not new, but it is observed by more users and more frequently now. While my handmade config loop can be reproduced with the current master, it may be worth the shot to try out the snapshop packages - if only to give us more informative logs ;)
As for our Object versioning, I have been talking to the IcingaDB team if it could be possible to move the checksum creation and comparison from the specific feature into the objects themself to be used by the ApiListener ConfigSync.
Working on this with @bobapple this week who could create a scenario where to reliably reproduce this.
Spoiler: It is not the version, but OnActiveChanged() with missing message origin object creation, in a HA master setup where the JSON-RPC queues are massively overloaded to move the Create/Delete (from expiry or cancel) after the first master's events.

We decided to move this into 2.11 as it affects customer environments testing the RC, and we now have a reliable reproducer.
@bobapple tested my patch quite extensively in the past 24 hours.
Downtime and Comment creation were specific cluster events and handlers prior to 2.4. With the addition of runtime create config objects, we've unified this behaviour into a general cluster message with the config::UpdateObject method.
This sync is located in https://github.com/Icinga/icinga2/blob/master/lib/remote/apilistener-configsync.cpp
Object events include the following and are triggered from OnActiveChanged and OnVersionChanged being the ConfigObject attributes active and version.
SetActive() is called during config object activation after compiling the config item into a config object. SetVersion() is called whenever an object was modified, e.g. through a POST request against /v1/objects or when synced throughout the cluster.
Depending on the object being active, or !active and marked as deleted, specific functions are called.
They prepare the cluster messages being relayed into the current zone.
First checks whether the target zone should really receive the message.
Reads the config content from the _api package storage. Adds modified attributes if any.
The message is either relayed into the zone, or directly sent to a connected endpoint via JsonRpcConnection.
First checks whether the target zone should really receive the message.
Creates delete message including the object's version.
The message is either relayed into the zone, or directly sent to a connected endpoint via JsonRpcConnection.
These are registered on config::UpdateObject and config::DeleteObject message types.
https://icinga.com/docs/icinga2/snapshot/doc/19-technical-concepts/#configupdateobject
https://icinga.com/docs/icinga2/snapshot/doc/19-technical-concepts/#configdeleteobject
One fundamental part of the cluster message routing is the MessageOrigin object. This is created when a new JSON-RPC message is received. It contains
These attributes are checked in message receive api handlers for security access. E.g. whether a message origin is from a child zone which is not allowed, etc.
Whenever such a message is processed on the client, it may trigger additional cluster events which are sent back to other endpoints. Therefore it is key to always pass the MessageOrigin origin when processing these messages locally.
Example:
sender for nowOnNewCheckResult() also calls a registered cluster handler which forwards the CheckResult to other cluster members.
Without any origin details, this CheckResult would be relayed to the sender endpoint again.
Which processes the message, ProcessCheckResult(), OnNewCheckResult(), sends back and so on.
That creates a loop which our cluster protocol needs to prevent at all cost.
RelayMessageOne() takes care of the routing. This involves fetching the targetZone for this message and its endpoints.
e1 is zone master, e2 and e3 are zone members.
Message is sent from e2 or e3:
!isMaster == true
targetEndpoint e1 is zone master -> send the message
targetEndpoint e3 is not zone master -> skip it, avoid routing loops
Message is sent from e1:
!isMaster == false -> send the messages to e2 and e3 being the zone routing master.
With passing the origin the following condition prevents sending a message back to sender:
if (origin && origin->FromClient && targetEndpoint == origin->FromClient->GetEndpoint()) {
This message then simply gets skipped for this specific Endpoint and is never sent.
CreateObject() and DeleteObject() don't forward the origin from received cluster messages.
Both function calls result in OnActiveChanged() being called again. This results in sending the cluster message again.
No origin - back to sender, being the first master which created the downtime/comment.
m1 -> create downtime
m2 -> receive downtime, OnActiveChanged -> send back to m1
m1 -> has additional guards with object->GetVersion() comparison and also checks for existance
Whenever the JSON-RPC messages exchanged between the two master nodes are many, and the json_rpc queues are full - you can monitor this via the icinga check as well as /v1/status REST API endpoint - something weird happens.
You'll get the idea also shown in in this picture.

Always add the MessageOrigin when processing config::* cluster events.
The culprit is that OnActiveChanged() is called without the origin. This needs some dancing inside the code to get there - which is described below.
Value ApiListener::ConfigDeleteObjectAPIHandler(const MessageOrigin::Ptr& origin, const Dictionary::Ptr& params)
if (!ConfigObjectUtility::DeleteObject(object, true, errors, nullptr)) {
Value ApiListener::ConfigUpdateObjectAPIHandler(const MessageOrigin::Ptr& origin, const Dictionary::Ptr& params)
Either via
if (!ConfigObjectUtility::CreateObject(ptype, objName, config, errors, nullptr)) {
or
object->SetVersion(objVersion, false, origin);
object->SetVersion(objVersion, false, origin);
-> OnVersionChanged is not triggered at all, since it contains the origin.
OnActiveChanged -> debug//lib/base/configobject-ti.cpp via mkclass compiler
NotifyActive() -> configobject.cpp
lib/config/configitem.cpp
bool ConfigItem::ActivateItems(WorkQueue& upq, const std::vector
void ConfigObject::StopObjects() -> Onshutdown, not deleted.
bool ConfigObjectUtility::DeleteObjectHelper(const ConfigObject::Ptr& object, bool cascade,
const Array::Ptr& errors, const Array::Ptr& diagnosticInformation)
try {
/* mark this object for cluster delete event */
object->SetExtension("ConfigObjectDeleted", true);
/* triggers signal for DB IDO and other interfaces */
object->Deactivate(true);
NotifyActive() is called without origin cookie.
This needs to be wrapped via
object->Activate(runtimeCreated, cookie) -> ActivateItems -> CreateObject
object->Deactivate(runtimeCreated, cookie)
This bug exists since v2.4.0, so nearly 4 years.
Related issues and tickets:
ref/NC/618973
ref/NC/628495
ref/NC/509037
ref/NC/506213
ref/NC/573719
@jhou4 Since you're on RHEL, I've triggered snapshot packages in our package pipeline.
https://git.icinga.com/packaging/rpm-icinga2/pipelines/4916
Once built, please test the snapshot packages.
We tested 2.11.0 rpms on Redhat 7, and ran into some major issues:
I suspect those were from the significant rework done in version 2.11.0. We do have our own workaround for this downtime bug in place. So we will probably wait some time before upgrading.
If you can reproduce these problems, please collect all valuable details about your environment and create separate issues for each problem, please.
Unfortunately I have a Dev icinga cluster but it doesn't have the same workload as prod so I can't re-produce the problem easily. I searched existing issues with 2.11, think the problem I ran into are similar to a few others already opened. So I will just keep checking those. Thanks.
Okidoki, that's just fine. I only want to ensure that you're not running into something we do not know, and later on, you hit exactly that even when expecting it being fixed.
Most helpful comment
7423 adds more logging and fixes the problem with ignored downtime objects (non-existing hosts) with always logging the creation. This ensures that troubleshooting and debugging things like this becomes easier @widhalmt @mwaldmueller