Icinga2: Same downtime being created repeatedly in a cluster loop

Created on 23 May 2019  路  67Comments  路  Source: Icinga/icinga2

Describe the bug

Occasionally icinga constantly recreates same downtime over and over again, and barely does anything else. Issue goes away after reloads.

To Reproduce

Not sure how to reproduce.

Expected behavior

A downtime gets created once.

Screenshots

[2019-05-16 23:24:05 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:06 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:07 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.
[2019-05-16 23:24:07 -0400] information/ConfigObjectUtility: Created and activated object 'appbig-rr-931!7240b8e2-91aa-4b6a-b4db-800800e5ffd0' of type 'Downtime'.

Your Environment

Include as many relevant details about the environment you experienced the problem in

  • Version used (icinga2 --version):
    r2.10.4-1, but this happened on 2.10.2 as well.
  • Operating System and version:
    Red Hat Enterprise Linux Server release 7.3 (Maipo)
  • Enabled features (icinga2 feature list):
    Enabled features: api checker mainlog
  • Icinga Web 2 version and modules (System - About):
    2.6.2
    monitoring | 2.6.2
    setup | 2.6.2
  • Config validation (icinga2 daemon -C):
    ...
    [2019-05-23 14:55:33 -0400] information/cli: Finished validating the configuration file(s).

  • If you run multiple Icinga 2 instances, the zones.conf file (or icinga2 object list --type Endpoint and icinga2 object list --type Zone) from all affected nodes.
    object Zone "global-templates" {
    global = true
    }

object Zone "master" {
endpoints = ["master1", "master2"]
}

object Zone "zone1" {
endpoints = ["sat1", "sat2"]
parent = "master"
}

object Zone "zone2" {
endpoints = ["sat3", "sat4"]
parent = "master"
}
Note here master[12]/sat[1234] are simplified hostnames just to show the configuration.

Additional context

Add any other context about the problem here.
We first saw this issue on 2.10.2, was hoping 2.10.4 has a fix but it doesn't. The behavior is slightly different though. In 2.10.2, issue happened on one master and two endpoints in the same zone at the same time. With 2.10.4, so far it happend twice, both on two endpoints in the same zone only, not on master.

areruntime bug queuimportant reNC

Most helpful comment

7423 adds more logging and fixes the problem with ignored downtime objects (non-existing hosts) with always logging the creation. This ensures that troubleshooting and debugging things like this becomes easier @widhalmt @mwaldmueller

All 67 comments

How's that Downtime being added initially - via API action/Icinga Web 2 or generated from a ScheduledDowntime object?

Hi, thanks for looking into this.
We schedule downtimes from API on one of the master. One condition that could trigger this, was I found the downtime being recreated happened to have been scheduled on a master, when the master just started reloading, but the parent process was still processing everything. We have a large number of hosts/services, usually it takes a few minutes for the parent to finish up before the children processes take over.

Hi,

We have seen the same issue with Comments. From time to time our satellites go into spin and recreates comments about 200K times an hour. It is only two comments that are recreated over and over again. It looks like the config master disconnects from the satellite:
[2019-06-11 09:46:03 +0000] warning/TlsStream: TLS stream was disconnected.
[2019-06-11 09:46:03 +0000] warning/JsonRpcConnection: API client disconnected for identity 'eu-west-1-icinga-01.hosted.axis.com'
[2019-06-11 09:46:03 +0000] warning/ApiListener: Removing API client for endpoint 'eu-west-1-icinga-01.hosted.axis.com'. 0 API clients left.

And then the two comments are recreated:
[2019-06-11 09:46:08 +0000] information/ConfigObjectUtility: Created and activated object 'HOSTNAME!reboot-required!f4a48c8c-d6c4-4f2c-9fb6-0f1
3d5545762' of type 'Comment'.
[2019-06-11 09:46:08 +0000] information/ConfigObjectUtility: Created and activated object 'HOSTNAME!reboot-required!a1e4a26e-3e6b-45df-9fa6-5
13447d29d90' of type 'Comment'.

Brgds

This may be related to another bug. Are you using the IDO and see those replicated Downtimes/Comments there?

Hi,

We use IDO on the config master and the comments are only stored once in the database.

Did a restart of Icinga2 on the config master and got this on the satellites:

[2019-06-18 06:34:07 +0000] information/ApiListener: Applying config update from endpoint 'eu-west-1-icinga-01.hosted.axis.com' of zone 'master'.
[2019-06-18 06:34:07 +0000] information/ApiListener: Updating configuration file: /var/lib/icinga2/api/zones/global-templates//.timestamp
[2019-06-18 06:34:07 +0000] information/ApiListener: Applying configuration file update for path '/var/lib/icinga2/api/zones/global-templates' (17 Bytes). Received timestamp
'2019-06-18 06:33:55 +0000' (1560839635.996240), Current timestamp '2019-06-13 02:00:43 +0000' (1560391243.082262).
[2019-06-18 06:34:07 +0000] information/ApiListener: Updating configuration file: /var/lib/icinga2/api/zones/us-west-2//.timestamp
[2019-06-18 06:34:07 +0000] information/ApiListener: Applying configuration file update for path '/var/lib/icinga2/api/zones/us-west-2' (17 Bytes). Received timestamp '2019-06-18 06:33:55 +0000' (1560839635.993111), Current timestamp '2019-06-13 02:00:42 +0000' (1560391242.987072).

After this the create bonanza starts and continues until Icinga is restarted on the satellite.

I've started debuglog on one of our satellites and hopefully I can trigger the error while debuglog is running.

I restart the config master and the debuglog on satellite 1 shows :
[2019-06-18 06:52:18 +0000] notice/JsonRpcConnection: Received 'config::UpdateObject' message from 'SATELLITE2'
[2019-06-18 06:52:18 +0000] notice/ApiListener: Received update for object: {"config":"object Comment \"c1a064e3-9d59-4973-a018-bc1a7c8f2164\" ignore_on_error {\n\tauthor = \"admin\"\n\tentry_time = 1549964140.574786\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"HOSTNAME\"\n\tpersistent = false\n\tservice_name = \"reboot-required\"\n\ttext = \"Reboot\"\n\tversion = 1549964140.574807\n\tzone = \"us-west-2\"\n}\n","modified_attributes":{},"name":"HOSTNAME!reboot-required!c1a064e3-9d59-4973-a018-bc1a7c8f2164","original_attributes":[],"type":"Comment","version":1549964140.5748069286}
[2019-06-18 06:52:18 +0000] notice/ApiListener: Discarding config update for object 'HOSTNAME!reboot-required!c1a064e3-9d59-4973-a018-bc1a7c8f2164': Object version 1549964140.574807 is more recent than the received version 1549964140.574807.

This repeats for two comments until Icinga2 is restarted. So it looks like the error occurs when the object update comes from satellite 2

Investigate on whether your _api package is broken on the affected endpoints, this is the internal storage for runtime created objects including downtimes/comments. Fixes for this are included in the upcoming 2.11 with snapshot packages available for testing.

You can also analyse this on your own with using this guide.

It looks alright when I follow the guide.
I've also stopped icinga on the satellite, removed packages and zones from /var/lib/icinga2/api and started icinga again to make sure that it gets a completely fresh sync from the config master.
I will continue to try and trigger the error and see if I can collect any more information.

@omattsson which Icinga version are you using?

@dnsmichi We run 2.10.3-1 in production at the moment.

Hm, ok. And there are no time difference between the instances, e.g. ntp is configured and the local ntp checks don't say that the offset has gone nuts?

They use the same NTP server and the offset on satellite 1 is -0.0005632638931 secs and satellite 2 is 0.007149338722 secs. The master has an offset of 0.0002931952477 secs. I consider that to be an okay offset at least.

Ok, thanks. Since you're not the original issue author, I'm just going the detail collect route again here. I'm wondering about the "comment object route" here, like which endpoint is the source for the comment creation and which other endpoints do the routing of that message, each creating the comment ... that would be awesome if extracted from the debug logs.

@jhou4 are you using ScheduledDowntime objects which generate such runtime Downtime objects? I've now spotted a similar case where one of my masters had a local host object with services, generated a downtime from an SD, and the first master couldn't apply the object because there was a missing object reference.

[2019-06-19 16:53:58 +0200] notice/JsonRpcConnection: Received 'config::UpdateObject' message from 'master2'
[2019-06-19 16:53:58 +0200] notice/ApiListener: Received update for object: {"config":"object Downtime \"08998dc7-c0e4-46fa-a382-39b438097c75\" ignore_on_error {\n\tauthor = \"icingaadmin\"\n\tauthoritative_zone = \"master\"\n\tcomment = \"Scheduled downtime for backup\"\n\tconfig_owner = \"master2!load!backup-downtime\"\n\tduration = 0.000000\n\tend_time = 1560992400.000000\n\tentry_time = 1560955436.475581\n\tfixed = true\n\thost_name = \"master2\"\n\tscheduled_by = \"master2!load!backup-downtime\"\n\tservice_name = \"load\"\n\tstart_time = 1560988800.000000\n\ttriggered_by = \"\"\n\tversion = 1560955436.475701\n}\n","modified_attributes":{},"name":"master2!load!08998dc7-c0e4-46fa-a382-39b438097c75","original_attributes":[],"type":"Downtime","version":1560955436.475701}
[2019-06-19 16:53:58 +0200] notice/ConfigCompiler: Compiling config file: var-a/lib/icinga2/api/packages/_api/b01e7361-78b3-425d-9c35-7cd214fd2c7d/conf.d/downtimes/master2!load!08998dc7-c0e4-46fa-a382-39b438097c75.conf
[2019-06-19 16:53:58 +0200] debug/configitem: Committing 1 new items.
[2019-06-19 16:53:58 +0200] notice/WorkQueue: Spawning WorkQueue threads for 'ConfigObjectUtility::CreateObject'
[2019-06-19 16:53:58 +0200] debug/ConfigItem: Commit called for ConfigItem Type=Downtime, Name=08998dc7-c0e4-46fa-a382-39b438097c75
[2019-06-19 16:53:58 +0200] notice/ConfigObject: Ignoring config object '08998dc7-c0e4-46fa-a382-39b438097c75' of type 'Downtime' due to errors: Error: Validation failed for object 'master2!load!08998dc7-c0e4-46fa-a382-39b438097c75' of type 'Downtime'; Attribute 'host_name': Object 'master2' of type 'Host' does not exist.
Location: in var-a/lib/icinga2/api/packages/_api/b01e7361-78b3-425d-9c35-7cd214fd2c7d/conf.d/downtimes/master2!load!08998dc7-c0e4-46fa-a382-39b438097c75.conf: 1:0-1:69
var-a/lib/icinga2/api/packages/_api/b01e7361-78b3-425d-9c35-7cd214fd2c7d/conf.d/downtimes/master2!load!08998dc7-c0e4-46fa-a382-39b438097c75.conf(1): object Downtime "08998dc7-c0e4-46fa-a382-39b438097c75" ignore_on_error {
                                                                                                                                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
var-a/lib/icinga2/api/packages/_api/b01e7361-78b3-425d-9c35-7cd214fd2c7d/conf.d/downtimes/master2!load!08998dc7-c0e4-46fa-a382-39b438097c75.conf(2):  author = "icingaadmin"
var-a/lib/icinga2/api/packages/_api/b01e7361-78b3-425d-9c35-7cd214fd2c7d/conf.d/downtimes/master2!load!08998dc7-c0e4-46fa-a382-39b438097c75.conf(3):  authoritative_zone = "master"

[2019-06-19 16:53:58 +0200] debug/configitem: Committed 1 items of type 'Downtime'.
[2019-06-19 16:53:58 +0200] debug/configitem: Committed 1 items.
[2019-06-19 16:53:58 +0200] information/ConfigObjectUtility: Created and activated object 'master2!load!08998dc7-c0e4-46fa-a382-39b438097c75' of type 'Downtime'.
[2019-06-19 16:53:58 +0200] notice/WorkQueue: Stopped WorkQueue threads for 'ConfigObjectUtility::CreateObject'

That would also explain the always same error message, since the object never is really created on the endpoint.

In our environment most downtimes are scheduled from schedule-downtime API call. Occasionally user could create adhoc downtime from Icingaweb. We do not use ScheduledDowntime(there is only a sample ScheduledDowntime rule in downtimes.conf, but I verified it did not create any object).

Ok, then a different thought - is the object where the downtime is created for, really existing on the first master/satellite endpoint which seems to log a loop of created objects?

Your initial log shows a host object, which might be defined not inside the master zone, can you share the output of

icinga2 object list --type Host --name appbig-rr-931

on both masters?

Unless configuration was out of sync somehow, I would assume that the object existed on masters/satellites. Didn't verify that particular host object last time when I looked at it. Will keep it in mind and do that next time.

Here are the outputs from two masters.
Config master:
iob1-cas-icinga01 /etc/icinga2/zones.d/global-templates/base-checks # icinga2 object list --type Host --name appbig-rr-931
Object 'appbig-rr-931' of type 'Host':
% declared in '/etc/icinga2/zones.d/nj/hosts.conf', lines 114751:1-114751:27

  • __name = "appbig-rr-931"
  • action_url = ""
  • address = "appbig-rr-931"
    % = modified in '/etc/icinga2/zones.d/nj/hosts.conf', lines 114754:3-114754:27
  • address6 = ""
  • check_command = "bbhostalive"
    % = modified in '/etc/icinga2/zones.d/global-templates/templates.conf', lines 6:3-6:31
  • check_interval = 60
    % = modified in '/etc/icinga2/zones.d/global-templates/templates.conf', lines 3:3-3:21
  • check_period = ""
  • check_timeout = null
  • command_endpoint = ""
  • display_name = "appbig-rr-931"
    % = modified in '/etc/icinga2/zones.d/nj/hosts.conf', lines 114753:3-114753:32
  • enable_active_checks = true
  • enable_event_handler = true
  • enable_flapping = false
  • enable_notifications = true
  • enable_passive_checks = true
  • enable_perfdata = true
  • event_command = ""
  • flapping_threshold = 0
  • flapping_threshold_high = 30
  • flapping_threshold_low = 25
  • groups = [ ]
  • icon_image = ""
  • icon_image_alt = ""
  • max_check_attempts = 4
    % = modified in '/etc/icinga2/zones.d/global-templates/templates.conf', lines 2:3-2:24
  • name = "appbig-rr-931"
  • notes = ""
  • notes_url = ""
  • package = "_etc"
  • retry_interval = 15
    % = modified in '/etc/icinga2/zones.d/global-templates/templates.conf', lines 4:3-4:22
  • source_location

    • first_column = 1

    • first_line = 114751

    • last_column = 27

    • last_line = 114751

    • path = "/etc/icinga2/zones.d/nj/hosts.conf"

  • templates = [ "appbig-rr-931", "generic-host" ]
    % = modified in '/etc/icinga2/zones.d/nj/hosts.conf', lines 114751:1-114751:27
    % = modified in '/etc/icinga2/zones.d/global-templates/templates.conf', lines 1:0-1:27
  • type = "Host"
  • vars

    • architecture = "SPARC"

      % = modified in '/etc/icinga2/zones.d/nj/hosts.conf', lines 114756:3-114756:29

    • bbcpu_groups = [ "BASS", "BBCPU", "BBRC", "BBXFER", "BPKG", "BRCV", "CHMERA", "DBINI", "DEV", "DEVBIG", "DVMN", "LIBS", "MBUS", "PDCLD", "PRCCHK", "PRQSEX", "RIDGE", "RMDBX", "RTCPUX", "SN2", "SN2A", "SPARC", "TZODB", "SDCLD", "BRDEV", "TSMON", "AUTDLV", "TSPGM", "SPRCSA", "EMA", "SYMNPD", "RSP", "BAGNT", "RPLUSD", "DPRMCM", "API", "DEVQ1", "TSTPR", "RTCPU", "RMQUE", "ORCLC", "DB2C", "CRTEST", "TST2ND", "RSEDPL", "DTICDB", "FUPCSH", "TKRMON", "AUTCRM", "DEVTLS", "FSLINK", "PNWTSK", "ASPLKF", "PROX2", "PARMCM", "DMP", "RMQDT", "RMTEST", "RMTTKH", "RT0RCV", "LVM", "BASQD", "DEVSUN", "ZJOBDV", "PRCLT", "LOGIN", "RD0RCV", "DLOGIN", "BADNS", "FIX", "STPW", "TIER4", "TKHASH", "ZJOB", "NFPVDV", "LABELS", "C2LREP", "SNDMSG", "SETPR", "PHQL", "UCBR", "MTRAGT", "CUSTCM", "CTM20", "CTMA", "BMQ", "FMSVC", "PRQSSALT", "MQCLNT", "UUCHP", "AUTORC", "PROCAUTHSL2", "FEAT", "APPPC" ]

      % = modified in '/etc/icinga2/zones.d/nj/hosts.conf', lines 114766:3-114766:883

    • cluster = "APPBIG"

      % = modified in '/etc/icinga2/zones.d/nj/hosts.conf', lines 114760:3-114760:25

    • cluster_parent = "APPPC"

      % = modified in '/etc/icinga2/zones.d/nj/hosts.conf', lines 114761:3-114761:31

    • datacenter_region = "RIDGE ROAD"

      % = modified in '/etc/icinga2/zones.d/nj/hosts.conf', lines 114762:3-114762:39

    • icingazone = "nj"

      % = modified in '/etc/icinga2/zones.d/nj/hosts.conf', lines 114759:3-114759:24

    • ignore_services = [ ]

      % = modified in '/etc/icinga2/zones.d/nj/hosts.conf', lines 114765:3-114765:27

    • platform = "solaris2"

      % = modified in '/etc/icinga2/zones.d/nj/hosts.conf', lines 114763:3-114763:28

    • policy_group = "na"

      % = modified in '/etc/icinga2/zones.d/nj/hosts.conf', lines 114764:3-114764:26

    • rhst = "APPBIG-RR-931"

      % = modified in '/etc/icinga2/zones.d/nj/hosts.conf', lines 114755:3-114755:29

    • stage = "SN2"

      % = modified in '/etc/icinga2/zones.d/nj/hosts.conf', lines 114757:3-114757:20

    • substage = "SN2A"

      % = modified in '/etc/icinga2/zones.d/nj/hosts.conf', lines 114758:3-114758:24

  • volatile = false
  • zone = "nj"

Second master:
inj1-cas-icinga01 /var/log/icinga2 # icinga2 object list --type Host --name appbig-rr-931
Object 'appbig-rr-931' of type 'Host':
% declared in '/var/lib/icinga2/api/zones/nj/_etc/hosts.conf', lines 114751:1-114751:27

  • __name = "appbig-rr-931"
  • action_url = ""
  • address = "appbig-rr-931"
    % = modified in '/var/lib/icinga2/api/zones/nj/_etc/hosts.conf', lines 114754:3-114754:27
  • address6 = ""
  • check_command = "bbhostalive"
    % = modified in '/var/lib/icinga2/api/zones/global-templates/_etc/templates.conf', lines 6:3-6:31
  • check_interval = 60
    % = modified in '/var/lib/icinga2/api/zones/global-templates/_etc/templates.conf', lines 3:3-3:21
  • check_period = ""
  • check_timeout = null
  • command_endpoint = ""
  • display_name = "appbig-rr-931"
    % = modified in '/var/lib/icinga2/api/zones/nj/_etc/hosts.conf', lines 114753:3-114753:32
  • enable_active_checks = true
  • enable_event_handler = true
  • enable_flapping = false
  • enable_notifications = true
  • enable_passive_checks = true
  • enable_perfdata = true
  • event_command = ""
  • flapping_threshold = 0
  • flapping_threshold_high = 30
  • flapping_threshold_low = 25
  • groups = [ ]
  • icon_image = ""
  • icon_image_alt = ""
  • max_check_attempts = 4
    % = modified in '/var/lib/icinga2/api/zones/global-templates/_etc/templates.conf', lines 2:3-2:24
  • name = "appbig-rr-931"
  • notes = ""
  • notes_url = ""
  • package = "_cluster"
  • retry_interval = 15
    % = modified in '/var/lib/icinga2/api/zones/global-templates/_etc/templates.conf', lines 4:3-4:22
  • source_location

    • first_column = 1

    • first_line = 114751

    • last_column = 27

    • last_line = 114751

    • path = "/var/lib/icinga2/api/zones/nj/_etc/hosts.conf"

  • templates = [ "appbig-rr-931", "generic-host" ]
    % = modified in '/var/lib/icinga2/api/zones/nj/_etc/hosts.conf', lines 114751:1-114751:27
    % = modified in '/var/lib/icinga2/api/zones/global-templates/_etc/templates.conf', lines 1:0-1:27
  • type = "Host"
  • vars

    • architecture = "SPARC"

      % = modified in '/var/lib/icinga2/api/zones/nj/_etc/hosts.conf', lines 114756:3-114756:29

    • bbcpu_groups = [ "BASS", "BBCPU", "BBRC", "BBXFER", "BPKG", "BRCV", "CHMERA", "DBINI", "DEV", "DEVBIG", "DVMN", "LIBS", "MBUS", "PDCLD", "PRCCHK", "PRQSEX", "RIDGE", "RMDBX", "RTCPUX", "SN2", "SN2A", "SPARC", "TZODB", "SDCLD", "BRDEV", "TSMON", "AUTDLV", "TSPGM", "SPRCSA", "EMA", "SYMNPD", "RSP", "BAGNT", "RPLUSD", "DPRMCM", "API", "DEVQ1", "TSTPR", "RTCPU", "RMQUE", "ORCLC", "DB2C", "CRTEST", "TST2ND", "RSEDPL", "DTICDB", "FUPCSH", "TKRMON", "AUTCRM", "DEVTLS", "FSLINK", "PNWTSK", "ASPLKF", "PROX2", "PARMCM", "DMP", "RMQDT", "RMTEST", "RMTTKH", "RT0RCV", "LVM", "BASQD", "DEVSUN", "ZJOBDV", "PRCLT", "LOGIN", "RD0RCV", "DLOGIN", "BADNS", "FIX", "STPW", "TIER4", "TKHASH", "ZJOB", "NFPVDV", "LABELS", "C2LREP", "SNDMSG", "SETPR", "PHQL", "UCBR", "MTRAGT", "CUSTCM", "CTM20", "CTMA", "BMQ", "FMSVC", "PRQSSALT", "MQCLNT", "UUCHP", "AUTORC", "PROCAUTHSL2", "FEAT", "APPPC" ]

      % = modified in '/var/lib/icinga2/api/zones/nj/_etc/hosts.conf', lines 114766:3-114766:883

    • cluster = "APPBIG"

      % = modified in '/var/lib/icinga2/api/zones/nj/_etc/hosts.conf', lines 114760:3-114760:25

    • cluster_parent = "APPPC"

      % = modified in '/var/lib/icinga2/api/zones/nj/_etc/hosts.conf', lines 114761:3-114761:31

    • datacenter_region = "RIDGE ROAD"

      % = modified in '/var/lib/icinga2/api/zones/nj/_etc/hosts.conf', lines 114762:3-114762:39

    • icingazone = "nj"

      % = modified in '/var/lib/icinga2/api/zones/nj/_etc/hosts.conf', lines 114759:3-114759:24

    • ignore_services = [ ]

      % = modified in '/var/lib/icinga2/api/zones/nj/_etc/hosts.conf', lines 114765:3-114765:27

    • platform = "solaris2"

      % = modified in '/var/lib/icinga2/api/zones/nj/_etc/hosts.conf', lines 114763:3-114763:28

    • policy_group = "na"

      % = modified in '/var/lib/icinga2/api/zones/nj/_etc/hosts.conf', lines 114764:3-114764:26

    • rhst = "APPBIG-RR-931"

      % = modified in '/var/lib/icinga2/api/zones/nj/_etc/hosts.conf', lines 114755:3-114755:29

    • stage = "SN2"

      % = modified in '/var/lib/icinga2/api/zones/nj/_etc/hosts.conf', lines 114757:3-114757:20

    • substage = "SN2A"

      % = modified in '/var/lib/icinga2/api/zones/nj/_etc/hosts.conf', lines 114758:3-114758:24

  • volatile = false
  • zone = "nj"

We had a new occurrence last night. It was auto-fixed(we put in a script to monitor and auto-fix the condition). So I couldn't tell if configuration was out of sync. But there were some extra errors, which hopefully can help to identify the cuase:

[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:16 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:16 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:16 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:16 -0400] critical/config: Error: An object with type 'Downtime' and name 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' already exists (in /var/lib/icinga2/api/packages/_api/b334a484-3d9e-41b2-b491-1e899fd6af91/conf.d/downtimes/p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c.conf: 1:0-1:69), new declaration: in /var/lib/icinga2/api/packages/_api/b334a484-3d9e-41b2-b491-1e899fd6af91/conf.d/downtimes/p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c.conf: 1:0-1:69
Location: in /var/lib/icinga2/api/packages/_api/b334a484-3d9e-41b2-b491-1e899fd6af91/conf.d/downtimes/p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c.conf: 1:0-1:69
/var/lib/icinga2/api/packages/_api/b334a484-3d9e-41b2-b491-1e899fd6af91/conf.d/downtimes/p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c.conf(1): object Downtime "492f19a6-f587-485b-9f4c-faa1b322169c" ignore_on_error {
                                                                                                                                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/var/lib/icinga2/api/packages/_api/b334a484-3d9e-41b2-b491-1e899fd6af91/conf.d/downtimes/p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c.conf(2):  author = "icinga2adm"
/var/lib/icinga2/api/packages/_api/b334a484-3d9e-41b2-b491-1e899fd6af91/conf.d/downtimes/p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c.conf(3):  comment = "OS REBOOT"

[2019-06-24 22:57:16 -0400] critical/config: 1 error
[2019-06-24 22:57:16 -0400] critical/ApiListener: Could not create object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c':
[2019-06-24 22:57:16 -0400] critical/ApiListener: Error: An object with type 'Downtime' and name 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' already exists (in /var/lib/icinga2/api/packages/_api/b334a484-3d9e-41b2-b491-1e899fd6af91/conf.d/downtimes/p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c.conf: 1:0-1:69), new declaration: in /var/lib/icinga2/api/packages/_api/b334a484-3d9e-41b2-b491-1e899fd6af91/conf.d/downtimes/p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c.conf: 1:0-1:69
Location: in /var/lib/icinga2/api/packages/_api/b334a484-3d9e-41b2-b491-1e899fd6af91/conf.d/downtimes/p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c.conf: 1:0-1:69
[2019-06-24 22:57:16 -0400] warning/PluginCheckTask: Check command for object 'nj-dfm3!check_ntp_proc' (PID: 67154, arguments: '/usr/lib64/nagios/plugins/check_nrpe' '-H' 'nj-dfm3' '-c' 'check_ntp_proc' '-t' '30') terminated with exit code 255, output: connect to address 10.126.187.94 port 5666: Connection refused
connect to host nj-dfm3 port 5666: Connection refused
[2019-06-24 22:57:16 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:16 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:16 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:16 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:16 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:16 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:16 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
...

Above logs were from one of the satellites. The same downtime was also being created repeatedly on two masters and the other satellite in the same zone at the same time. Below was the configure master(where the downtime was initially created via api), it started 10 minutes after the downtime was created:

iob1-cas-icinga01 /var/log/icinga2 # grep p203 icinga2.log-20190625 |more
[2019-06-24 22:47:33 -0400] information/ConfigObjectUtility: Created and activated object 'p203!c86ed88e-2dee-4773-8151-e0c68d9c34df' of type 'Downtime'.
[2019-06-24 22:47:35 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!53e5662c-aa4f-470a-a67e-8bb72a88916b' of type 'Downtime'.
[2019-06-24 22:47:38 -0400] information/Checkable: Checking for configured notifications for object 'p203!check_nrpe_client'
[2019-06-24 22:47:38 -0400] information/Checkable: Checking for configured notifications for object 'p203'
[2019-06-24 22:47:38 -0400] information/Checkable: Checking for configured notifications for object 'p203'
[2019-06-24 22:57:13 -0400] information/ConfigObjectUtility: Created and activated object 'p203!45aed60e-cd3c-446a-90e4-9aac8e179c37' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/Checkable: Checking for configured notifications for object 'p203'
[2019-06-24 22:57:15 -0400] information/Checkable: Checking for configured notifications for object 'p203'
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/Checkable: Checking for configured notifications for object 'p203!check_nrpe_client'
[2019-06-24 22:57:15 -0400] information/Checkable: Checking for configured notifications for object 'p203!check_nrpe_client'
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/Checkable: Checking for configured notifications for object 'p203!check_nrpe_client'
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/Checkable: Checking for configured notifications for object 'p203!check_nrpe_client'
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/Checkable: Checking for configured notifications for object 'p203!check_nrpe_client'
[2019-06-24 22:57:15 -0400] information/ConfigObjectUtility: Created and activated object 'p203!check_nrpe_client!492f19a6-f587-485b-9f4c-faa1b322169c' of type 'Downtime'.
[2019-06-24 22:57:15 -0400] information/Checkable: Checking for configured notifications for object 'p203!check_nrpe_client'
...

Correction, the one created at 22:47:35 was actually different from the repeating one starting at 22:57:15.

Could it be /var/lib/icinga2/api/packages/_api/b334a484-3d9e-41b2-b491-1e899fd6af91/conf.d/downtimes/$problematicDowntime.conf gets deleted and recreated after every load?

I'm looking at the code but don't understand how this happens after the Object Registration.

The issue goes away after a reload("systemctl reload icinga2" command).

@jhou4 can you please collect all the debug logs from both instances when this exact situation happens again? (or, if you have them already).

Here's a safe place to upload, no-one except you and us can see the files. Thanks. https://nextcloud.icinga.com/index.php/s/eLDRTLyKLeA2mEG

It may be a little challenging to collect debug logs due to the fast rate it grows.
But I can try logrotate. Will do some testing. Just want to confirm, you want the debug logs from one endpoint in each zone, not the two endpoints in the same zone, correct?

It looks like we can't rotate debug.log without restarting icinga2 service. The disk space I have available can hold a few hours' logs. That means I will have to restart icinga2 every few hours, which is not very practical, especially considering this issue only happens once or twice a month.
Any suggestions on how to collect the debug logs?

you want the debug logs from one endpoint in each zone, not the two endpoints in the same zone, correct?

All of them, if possible.

It looks like we can't rotate debug.log without restarting icinga2 service.

There's a special signal which is sent by logrotated - Icinga then knows to release the file handles while logrotate already moved the files.

/bin/kill -USR1 $(pidof icinga2) 2> /dev/null || true

Btw, anyone who experiences this issue, is invited to upload the debug logs to the nextcloud URL. The more logs are there, the better the error can be analysed.

That works. Thanks.
I set it up on all our Icinga instances. Now just need to wait until it happens again.

We got lucky, the issue happened at 01:28 this AM. I uploaded two debug logs(one from the config master, the other one was from a satellite in the affected zone) for now due to the size. The uploaded files are showing a "!" sign after finishing. Not sure if it is normal. Please check.
Also, let me know if you need logs from other icinga nodes.
Thanks.

Didn't work unfortunately. How big is the size of the compressed logs?

The compressed files were 1.3G/0.5G. I tried again but still got the same thing, it took a while, then gave the "!" sign but didn't say what was wrong though.

Ok, we had a limit there, it is now raised and should work. Sorry for the trouble.

I just uploaded the file again, result was still the same so I am not sure if it worked. Please check. Thanks.

Didn't work, and we re-tested our nextcloud instance with a dummy file of 1.9 GB. Seems your connection causes trouble here. Do you have the chance to put it somewhere safe and send me the URL to michael.[email protected]?

Company network may have firewall blocking the upload indeed. I am looking into this.

Sorry for the delay. I am still trying to get an OK for sharing the data.

Replay log investigation

  • Master M1 (config master) and Master M2 in zone master.
  • M2 is down (killed icinga2)
  • schedule-downtime action for M1
  • Replay log does not contain anything referring to the Downtime
  • Downtime is synced through config sync

With comments (acknowledgement) only the set-acknowledgement and clear-acknowledgement events are logged. They are relevant to the Checkable they belong to but not the comment.

Time traveling

  • M2 is now 5 minutes in the future
  • schedule-downtime on M1
  • M1 sends to M2
  • M2 reads and compiles Downtime
  • M2 sends the same update-config event back to M1
  • M1 ignores it because M2s more future time is not used, the event is sent back as is with M1s timestamp

Next steps

The issue is either with the config sync or the events sent during runtime. I'll investigate under which circumstances we are sending an UpdateObject event, maybe we are sending one when we shouldn't which then easily could be faulty and confusing.

I ran into the bug and was able to extract some information with the help of @dnsmichi

Circumstances

Master Master setup, henceforth named M1 and M2. M1 is the config master. Both instances are 2.10.5 + a few debug log messages patched in. They are not under much load and have ntpd active. M2 crashed with a segfault (my mistake) previously to the bug occurring.

Timeline

  • Created Ackcomment over API on M1
[2019-07-17 11:04:11 +0200] information/ApiListener: New client connection from [127.0.0.1]:35686 (no client certificate)
[2019-07-17 11:04:11 +0200] information/HttpServerConnection: Request: POST /v1/actions/acknowledge-problem?host=horst (from [127.0.0.1]:35686), user: root)
[2019-07-17 11:04:11 +0200] information/ConfigObjectUtility: Created and activated object 'horst!291e38d4-c273-4013-bfdf-d2233a51360d' of type 'Comment'.
[2019-07-17 11:04:11 +0200] critical/ApiListener: Relaying 'config::UpdateObject' message. Write log is 0
[2019-07-17 11:04:11 +0200] critical/ApiListener: Relaying 'event::SetAcknowledgement' message. Write log is 1
[2019-07-17 11:04:11 +0200] information/HttpServerConnection: HTTP client disconnected (from [127.0.0.1]:35686)
  • Received by M2
[2019-07-17 11:04:11 +0200] information/ConfigObjectUtility: Created and activated object 'horst!291e38d4-c273-4013-bfdf-d2233a51360d' of type 'Comment'.
[2019-07-17 11:04:11 +0200] critical/ApiListener: Relaying '0x7f2b740a9600' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 11:04:11 +0200] critical/ApiListener: Relaying '0x7f2b740aaee0' message from '0x7f2b74001870' to 'master' Want log: 0 need log: 0
[2019-07-17 11:04:11 +0200] critical/ApiListener: Relaying '0x7f2b740012b0' message from '0x7f2b74001910' to 'master' Want log: 1 need log: 0
  • Restarted M2
  • M2 relays a ClearAcknowledgement event, probably caused by the Host recovering
[2019-07-17 11:09:46 +0200] critical/ApiListener: Relaying '{
        jsonrpc = "2.0"
        method = "event::ClearAcknowledgement"
        params = {
                host = "horst"
        }
        ts = 1563354586.692563
}' message from '0x7f3da8001280' to 'master' Want log: 1 need log: 0
  • Both instances start bouncing 2 CreateObject events followed by a DeleteObject event and creating the object again
    M1:
[2019-07-17 11:09:46 +0200] information/ConfigObjectUtility: Created and activated object 'horst!291e38d4-c273-4013-bfdf-d2233a51360d' of type 'Comment'.
[2019-07-17 11:09:46 +0200] critical/ApiListener: Relaying 'config::UpdateObject' message. Write log is 0
[2019-07-17 11:09:46 +0200] critical/ApiListener: Relaying 'config::UpdateObject' message. Write log is 0
[2019-07-17 11:09:46 +0200] critical/ApiListener: Relaying 'config::DeleteObject' message. Write log is 0
[2019-07-17 11:09:46 +0200] information/ConfigObjectUtility: Created and activated object 'horst!291e38d4-c273-4013-bfdf-d2233a51360d' of type 'Comment'.
[2019-07-17 11:09:46 +0200] critical/ApiListener: Relaying 'config::UpdateObject' message. Write log is 0
[2019-07-17 11:09:46 +0200] critical/ApiListener: Relaying 'config::UpdateObject' message. Write log is 0
[2019-07-17 11:09:46 +0200] critical/ApiListener: Relaying 'config::DeleteObject' message. Write log is 0
[2019-07-17 11:09:46 +0200] information/ConfigObjectUtility: Created and activated object 'horst!291e38d4-c273-4013-bfdf-d2233a51360d' of type 'Comment'.
....

M2:
Starts with a single Create followed by Delete before switching to two creates

[2019-07-17 11:09:46 +0200] critical/ApiListener: Relaying '{
        jsonrpc = "2.0"
        method = "config::UpdateObject"
        params = {
                config = "object Comment \"291e38d4-c273-4013-bfdf-d2233a51360d\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563354251.591243\n\tentry_type = 4.000000\n\texpire_time =
 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563354251.591259\n\tzone = \"master\"\n}\n"
                modified_attributes = {
                }
                name = "horst!291e38d4-c273-4013-bfdf-d2233a51360d"
                original_attributes = [ ]
                type = "Comment"
                version = 1563354251.591259
        }
        ts = 1563354586.626437
}' message from '0' to 'master' Want log: 0 need log: 1
[2019-07-17 11:09:46 +0200] critical/ApiListener: Relaying '{
        jsonrpc = "2.0"
        method = "config::DeleteObject"
        params = {
                name = "horst!291e38d4-c273-4013-bfdf-d2233a51360d"
                type = "Comment"
                version = 1563354251.591259
        }
        ts = 1563354586.693694
}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 11:09:46 +0200] information/ConfigObjectUtility: Created and activated object 'horst!291e38d4-c273-4013-bfdf-d2233a51360d' of type 'Comment'.
[2019-07-17 11:09:46 +0200] critical/ApiListener: Relaying '{
        jsonrpc = "2.0"
        method = "config::UpdateObject"
        params = {
                config = "object Comment \"291e38d4-c273-4013-bfdf-d2233a51360d\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563354251.591243\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563354251.591259\n\tzone = \"master\"\n}\n"
                modified_attributes = {
                }
                name = "horst!291e38d4-c273-4013-bfdf-d2233a51360d"
                original_attributes = [ ]
                type = "Comment"
                version = 1563354251.591259
        }
        ts = 1563354586.713520
}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 11:09:46 +0200] critical/ApiListener: Relaying '{
        jsonrpc = "2.0"
        method = "config::UpdateObject"
        params = {
                config = "object Comment \"291e38d4-c273-4013-bfdf-d2233a51360d\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563354251.591243\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563354251.591259\n\tzone = \"master\"\n}\n"
                modified_attributes = {
                }
                name = "horst!291e38d4-c273-4013-bfdf-d2233a51360d"
                original_attributes = [ ]
                type = "Comment"
                version = 1563354251.591259
        }
        ts = 1563354586.718544
}' message from '0x7f3da8001640' to 'master' Want log: 0 need log: 0
[2019-07-17 11:09:46 +0200] critical/ApiListener: Relaying '{
        jsonrpc = "2.0"
        method = "config::DeleteObject"
        params = {
                name = "horst!291e38d4-c273-4013-bfdf-d2233a51360d"
                type = "Comment"
                version = 1563354251.591259
        }
        ts = 1563354586.718717
}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 11:09:46 +0200] information/ConfigObjectUtility: Created and activated object 'horst!291e38d4-c273-4013-bfdf-d2233a51360d' of type 'Comment'.
...

Note that the version stays the same but the timestamp changes.

The issue resolved when we tried to attach gdb on M1 to the process and thereby stopped execution:

[2019-07-17 11:40:25 +0200] critical/ApiListener: Relaying 'config::UpdateObject' message. Write log is 0
[2019-07-17 11:40:25 +0200] critical/ApiListener: Relaying 'config::UpdateObject' message. Write log is 0
[2019-07-17 11:40:25 +0200] critical/ApiListener: Relaying 'config::DeleteObject' message. Write log is 0
[2019-07-17 11:42:54 +0200] information/Application: We jumped forward in time: 151.2 seconds
[2019-07-17 11:42:54 +0200] warning/JsonRpcConnection: API client disconnected for identity 'icinga2-centos7-dev.vagrant.demo.icinga.com'
[2019-07-17 11:42:54 +0200] warning/ApiListener: Removing API client for endpoint 'icinga2-centos7-dev.vagrant.demo.icinga.com'. 0 API clients left.
[2019-07-17 11:42:54 +0200] warning/TlsStream: TLS stream was disconnected.
[2019-07-17 11:42:54 +0200] warning/TlsStream: TLS stream was disconnected.

While this was happening the file in question was deleted and an created all the time (M2):

[root@icinga2-centos7-dev var]# ls lib/icinga2/api/packages/_api/8f06cf47-5301-4111-994e-d754b4ec7fba/conf.d/comments/horst\!291e38d4-c273-4013-bfdf-d2233a51360d.conf
ls: cannot access lib/icinga2/api/packages/_api/8f06cf47-5301-4111-994e-d754b4ec7fba/conf.d/comments/horst!291e38d4-c273-4013-bfdf-d2233a51360d.conf: No such file or directory
[root@icinga2-centos7-dev var]# ls lib/icinga2/api/packages/_api/8f06cf47-5301-4111-994e-d754b4ec7fba/conf.d/comments/horst\!291e38d4-c273-4013-bfdf-d2233a51360d.conf
lib/icinga2/api/packages/_api/8f06cf47-5301-4111-994e-d754b4ec7fba/conf.d/comments/horst!291e38d4-c273-4013-bfdf-d2233a51360d.conf
[root@icinga2-centos7-dev var]# ls lib/icinga2/api/packages/_api/8f06cf47-5301-4111-994e-d754b4ec7fba/conf.d/comments/horst\!291e38d4-c273-4013-bfdf-d2233a51360d.conf
lib/icinga2/api/packages/_api/8f06cf47-5301-4111-994e-d754b4ec7fba/conf.d/comments/horst!291e38d4-c273-4013-bfdf-d2233a51360d.conf

Conclusion

While I don't think datetime offset between the servers has anything to do with this, this is certainly a timing issue. The version (which did not change) is the most important change indicator, the check for it may be problematic.
Other than that the existence of two CreateObject event is notable. One has the instance it self as origin, the other that of the remote. It would be interesting to know if they are the same ones as on the other master

Next Steps

I will play around with the versions a bit, maybe I'm able to force the bug this way. If so it will go a long way to finally put this issue to rest.

I am glad to see that we are getting closer to the root cause of this issue. Sorry I couldn't provide the full debug logs. But maybe I can extract some info from the logs if that helps. Just let me know. Thanks.

Another run in with this + more log info:

[2019-07-17 15:28:22 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::DeleteObject","params":{"name":"horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47","type":"Comment","version":1563370041.8454220295},"ts":1563370102.3430480957}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 15:28:22 +0200] critical/ApiListener: Received update for object: {"config":"object Comment \"f7c9e42a-63ca-4e8e-92e3-b6e8462dae47\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563369981.845314\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563370041.845422\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47","original_attributes":[],"type":"Comment","version":1563370041.8454220295}
[2019-07-17 15:28:22 +0200] critical/DEBUG: CUOH: object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47' is active
Context:
        (0) Activating object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47' of type 'Comment'

[2019-07-17 15:28:22 +0200] information/ConfigObjectUtility: Created and activated object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47' of type 'Comment'.
[2019-07-17 15:28:22 +0200] critical/DEBUG: CUOH: object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47' is active
[2019-07-17 15:28:22 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::UpdateObject","params":{"config":"object Comment \"f7c9e42a-63ca-4e8e-92e3-b6e8462dae47\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563369981.845314\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563370041.845422\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47","original_attributes":[],"type":"Comment","version":1563370041.8454220295},"ts":1563370102.3733160496}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 15:28:22 +0200] critical/ApiListener: Discarding config update for object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47': Object version 1563370041.845422 is more recent than the received version 1563370041.845422.
[2019-07-17 15:28:22 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::UpdateObject","params":{"config":"object Comment \"f7c9e42a-63ca-4e8e-92e3-b6e8462dae47\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563369981.845314\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563370041.845422\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47","original_attributes":[],"type":"Comment","version":1563370041.8454220295},"ts":1563370102.3735001087}' message from '0x7f76b0023380' to 'master' Want log: 0 need log: 0
[2019-07-17 15:28:22 +0200] critical/DEBUG: CUOH: object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47' is inactive and Extension is set to:1
Context:
        (0) Deactivating object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47' of type 'Comment'

[2019-07-17 15:28:22 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::DeleteObject","params":{"name":"horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47","type":"Comment","version":1563370041.8454220295},"ts":1563370102.3739030361}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 15:28:22 +0200] critical/ApiListener: Received update for object: {"config":"object Comment \"f7c9e42a-63ca-4e8e-92e3-b6e8462dae47\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563369981.845314\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563370041.845422\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47","original_attributes":[],"type":"Comment","version":1563370041.8454220295}
[2019-07-17 15:28:22 +0200] critical/DEBUG: CUOH: object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47' is active
Context:
        (0) Activating object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47' of type 'Comment'

[2019-07-17 15:28:22 +0200] information/ConfigObjectUtility: Created and activated object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47' of type 'Comment'.

It's flipping between activation and deactiviation, triggering the activeChanged event as Michi suspected. In this case I had the following patches compiled in in hopes of forcing this (there is no reliable way to reproduce this yet)

+++ b/lib/remote/configobjectutility.cpp
@@ -89,6 +89,8 @@ String ConfigObjectUtility::CreateObjectConfig(const Type::Ptr& type, const Stri

        /* update the version for config sync */
        allAttrs->Set("version", Utility::GetTime());
+       allAttrs->Set("version", Utility::GetTime()+60);
+

+++ b/lib/remote/apilistener-configsync.cpp
@@ -191,7 +195,7 @@ Value ApiListener::ConfigUpdateObjectAPIHandler(const MessageOrigin::Ptr& origin
        }

        /* keep the object version in sync with the sender */
-       object->SetVersion(objVersion, false, origin);
+       object->SetVersion(objVersion+2, false, origin);

And hank you @jhou4 for your help so far. I think we have a good idea of where this is coming from but since we still can't reproduce this as needed, keep an eye out for this bug. Anything regarding the circumstances of this is interesting, eg. does it also happen during runtime or shutdown? Have other objects that Downtimes and Comments been affected etc.

The main question is - why would SetVersion() generate a Delete event. That's not yet clear to me.

[2019-07-17 15:28:22 +0200] critical/DEBUG: CUOH: object 'horst!f7c9e42a-63ca-4e8e-92e3-b6e8462dae47' is inactive and Extension is set to:1
C

Trace this event down. Maybe it is related to the reload handling.

Since it's only Downtimes and Comments (Ack comments in the cases I observed) it is also possible this happens as intended. Icinga starts, Host is checked and up again, Acknowledgement deleted.

Logs should prove that ;-)

35vhk1

In this case I had two acknowledgements on the Host, so double the fun. This also makes me think it's not something specific to that one object but tied to the ClearAcks event.
My workflow to semi reliably reproduce this:

  1. Config: object Host "horst" { check_command = "dummy"; check_interval=10s }
  2. Process Check result for "horst" to be down
  3. Acknowledge "horst"
  4. Restart a bunch

M2 (restarted a bunch):

[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"event::ClearAcknowledgement","params":{"host":"horst"},"ts":1563375558.5422809124}' message from '0x7fe470000fb0' to 'master' Want log: 1 need log: 0
[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' is inactive and Extension is set to:1
Context:
        (0) Deactivating object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' of type 'Comment'

[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' is inactive and Extension is set to:1
Context:
        (0) Deactivating object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' of type 'Comment'

[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::DeleteObject","params":{"name":"horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c","type":"Comment","version":1563375558.2112770081},"ts":1563375558.5436320305}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::DeleteObject","params":{"name":"horst!1077e401-e578-4096-9f87-afcd85ac5e2e","type":"Comment","version":1563375351.7715198994},"ts":1563375558.5441069603}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"event::SetNextCheck","params":{"host":"horst","next_check":1563375568.0342299938},"ts":1563375558.5444951057}' message from '0x7fe4700017a0' to 'master' Want log: 1 need log: 0
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"event::CheckResult","params":{"cr":{"active":true,"check_source":"dfws","command":null,"execution_end":1563375556.9986350536,"execution_start":1563375556.9986350536,"exit_status":0.0,"output":"Check was successful.","performance_data":[],"schedule_end":1563375556.9986488819,"schedule_start":1563375556.998208046,"state":0.0,"ttl":0.0,"type":"CheckResult","vars_after":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0},"vars_before":{"attempt":1.0,"reachable":true,"state":2.0,"state_type":0.0}},"host":"horst"},"ts":1563375558.5446588993}' message from '0x7fe4700017a0' to 'master' Want log: 1 need log: 0
[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' is active
Context:
        (0) Activating object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' of type 'Comment'

[2019-07-17 16:59:18 +0200] information/ConfigObjectUtility: Created and activated object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' of type 'Comment'.
[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' is active
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::UpdateObject","params":{"config":"object Comment \"1077e401-e578-4096-9f87-afcd85ac5e2e\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563375291.771488\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563375351.771520\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!1077e401-e578-4096-9f87-afcd85ac5e2e","original_attributes":[],"type":"Comment","version":1563375351.7715198994},"ts":1563375558.5641889572}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::UpdateObject","params":{"config":"object Comment \"1077e401-e578-4096-9f87-afcd85ac5e2e\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563375291.771488\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563375351.771520\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!1077e401-e578-4096-9f87-afcd85ac5e2e","original_attributes":[],"type":"Comment","version":1563375351.7715198994},"ts":1563375558.5649349689}' message from '0x7fe470001db0' to 'master' Want log: 0 need log: 0
[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' is active
Context:
        (0) Activating object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' of type 'Comment'

[2019-07-17 16:59:18 +0200] information/ConfigObjectUtility: Created and activated object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' of type 'Comment'.

M1 (left running):

[2019-07-17 16:59:18 +0200] warning/ApiListener: Ignoring config update for zone 'master' because we have an authoritative version of the zone's config.
[2019-07-17 16:59:18 +0200] critical/ApiListener: Received update for object: {"config":"object Comment \"1077e401-e578-4096-9f87-afcd85ac5e2e\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_tim
e = 1563375291.771488\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563375351.771520\n\tzone = \"master\"\n}\n"
,"modified_attributes":{},"name":"horst!1077e401-e578-4096-9f87-afcd85ac5e2e","original_attributes":[],"type":"Comment","version":1563375351.7715198994}
[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' is active
Context:
        (0) Activating object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' of type 'Comment'

[2019-07-17 16:59:18 +0200] information/ConfigObjectUtility: Created and activated object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' of type 'Comment'.
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::UpdateObject","params":{"config":"object Comment \"1077e401-e578-4096-9f87-afcd85ac5e2e\" ignore_on_
error {\n\tauthor = \"Bob\"\n\tentry_time = 1563375291.771488\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563
375351.771520\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!1077e401-e578-4096-9f87-afcd85ac5e2e","original_attributes":[],"type":"Comment","version":1563375351.7715198994},"t
s":1563375558.5496549606}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' is active
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::UpdateObject","params":{"config":"object Comment \"1077e401-e578-4096-9f87-afcd85ac5e2e\" ignore_on_
error {\n\tauthor = \"Bob\"\n\tentry_time = 1563375291.771488\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563
375351.771520\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!1077e401-e578-4096-9f87-afcd85ac5e2e","original_attributes":[],"type":"Comment","version":1563375351.7715198994},"t
s":1563375558.550219059}' message from '0x7f5fa0017fe0' to 'master' Want log: 0 need log: 0
[2019-07-17 16:59:18 +0200] critical/ApiListener: Discarding config update for object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e': Object version 1563375351.771520 is more recent than the receiv
ed version 1563375351.771520.
[2019-07-17 16:59:18 +0200] critical/ApiListener: Received update for object: {"config":"object Comment \"ed1a05e3-a9e2-48dc-8edb-91758dbbc75c\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563375498.211203\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563375558.211277\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c","original_attributes":[],"type":"Comment","version":1563375558.2112770081}
[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' is active
Context:
        (0) Activating object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' of type 'Comment'

[2019-07-17 16:59:18 +0200] information/ConfigObjectUtility: Created and activated object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' of type 'Comment'.
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::UpdateObject","params":{"config":"object Comment \"ed1a05e3-a9e2-48dc-8edb-91758dbbc75c\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563375498.211203\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563375558.211277\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c","original_attributes":[],"type":"Comment","version":1563375558.2112770081},"ts":1563375558.5557129383}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' is active
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::UpdateObject","params":{"config":"object Comment \"ed1a05e3-a9e2-48dc-8edb-91758dbbc75c\" ignore_on_error {\n\tauthor = \"Bob\"\n\tentry_time = 1563375498.211203\n\tentry_type = 4.000000\n\texpire_time = 0.000000\n\thost_name = \"horst\"\n\tpersistent = false\n\ttext = \"foo\"\n\tversion = 1563375558.211277\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c","original_attributes":[],"type":"Comment","version":1563375558.2112770081},"ts":1563375558.5573120117}' message from '0x7f5fa0017ae0' to 'master' Want log: 0 need log: 0
[2019-07-17 16:59:18 +0200] critical/ApiListener: Discarding config update for object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c': Object version 1563375558.211277 is more recent than the received version 1563375558.211277.
[2019-07-17 16:59:18 +0200] critical/ApiListener: Received update for object: {"config":"object Downtime \"ce5ba6e5-74fd-4ddb-a53f-ae0e23f3f379\" ignore_on_error {\n\tauthor = \"baum\"\n\tcomment = \"tree\"\n\tconfig_owner = \"\"\n\tduration = 0.000000\n\tend_time = 1563370435.000000\n\tentry_time = 1563370075.939388\n\tfixed = true\n\thost_name = \"horst\"\n\tscheduled_by = \"\"\n\tstart_time = 1563370075.000000\n\ttriggered_by = \"\"\n\tversion = 1563370135.939455\n\tzone = \"master\"\n}\n","modified_attributes":{},"name":"horst!ce5ba6e5-74fd-4ddb-a53f-ae0e23f3f379","original_attributes":[],"type":"Downtime","version":1563370135.9394550323}
[2019-07-17 16:59:18 +0200] critical/ApiListener: Discarding config update for object 'horst!ce5ba6e5-74fd-4ddb-a53f-ae0e23f3f379': Object version 1563370135.939455 is more recent than the received version 1563370135.939455.
[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' is inactive and Extension is set to:1
Context:
        (0) Deactivating object 'horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c' of type 'Comment'
[2019-07-17 16:59:18 +0200] critical/ApiListener: Relaying '{"jsonrpc":"2.0","method":"config::DeleteObject","params":{"name":"horst!ed1a05e3-a9e2-48dc-8edb-91758dbbc75c","type":"Comment","version":1563375558.2112770081},"ts":1563375558.5586109161}' message from '0' to 'master' Want log: 0 need log: 0
[2019-07-17 16:59:18 +0200] critical/DEBUG: CUOH: object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' is inactive and Extension is set to:1
Context:
        (0) Deactivating object 'horst!1077e401-e578-4096-9f87-afcd85ac5e2e' of type 'Comment'

@Crunsher - In the past with version 2.10.2, this issue(when it happened) always starts shortly after a reload. With version 2.10.4, it started on the primary master or satellites without an obvious reload(although I am not able to verify if the second master actually crashed right before that).
I have only seen this behavior with downtimes, not other objects.

@Crunsher I don't exactly see what the logs are telling here. Our main focus should be on why a Comment/Downtime would be deleted thus following the create and delete again. Obviously it has to happen in the same moment, with a delay (stop or gdb attached process) it stops immediately. This is why I do think that it is related to the object version and activation handling in this specific scope.

We should also try whether the version-patch works to trigger the Downtime loop.

@jhou4 how is it going with the debug logs on your side?

The removal of the Acknowledgement is the trigger, something similar probably has to happen for downtimes. I had a downtime while testing this but only the Ack comment would be duplicated and only when the ClearAcknowledgement Event was involved.

Since this issue is about downtimes, we should try to reproduce it with this type as well. This points to specific object activation/deactivation and our handlers with receiving/denying config updates based on their version.

@dnsmichi There is some internal concern with sharing the logs. Very likely we will have to scrub the logs before uploading. I am still trying to get it approved. Sorry about that.

Reproducing this with Downtimes has turned out more complicated. Analogous to Comments I think this is tied to the automatic deletion, in this case the Expire handler.
With Downtimes it runs on a timer and not on demand, this makes getting to that cirtical point harder to predict.
Something I did observe is:

  • M1 expires and deletes downtime but is restarted at the same time
  • M2 reconnects to M1
  • M1 has the Downtimes again for a brief moment before deleting them

I think this is close to the issue but it's missing an ingredient X, possibly high load or a unfinished startup.

Cancelling a downtime via remove-downtime should possibly have the same impact. Otherwise you might just inject some code which automatically expires the downtime after X seconds to have this predictable in your tests.

I can now very reliably reproduce this with Acknowledgements.

The following happens:

  1. Acknowledgement C1 is created and synced
  2. M2 is shut down
  3. Host recovers, C1 deleted on M1, ClearAcknowledgement event written to replay log
  4. M2 is started again, config is read, UpdateConfig event is sent for C1
  5. M1 has C1 deleted, So C1 is created, UpdateConfig is sent for C1
  6. Meanwhile M2 reads the ClearAck event and deletes C1. Sends Delete event
  7. M1 Creates C1, sends UpdateConfig...

You get where I'm going with this.
See also, this handy diagram:
bam

This is further supported by making the replay log on M1 read only suppressing the bug. What makes it weird is that Downtimes do not have an event like that.

Final remarks before I sign off for the weekend (So I remember what's going on on Monday):

  • Can be reproduced with Ack Comments
  • Downtimes probably caused by different trigger but same effect
  • Both Comments and Downtimes use an ExpireTimer to handle deletion

Might be related: #7094

We just uploaded debug logs from two masters and two satellites.
The files have about 10 minutes' logs around the time when the issue started. Please take a look.
Sorry for the delay.

Thanks. I've shared it with the responsible team members, @Crunsher will come to you with more details asap.

The issue has gotten very long, so here is a summary.

The Problem

Downtimes and Comments would be created repeatedly, spamming the log and IDO history tables.

The Environment

Only zones with more than one endpoint are affected. Further the Comments and Downtimes are removed/expired. I have seen this issue with versions of icinga 2.10.x upwards.

How to reproduce

The easiest way to reproduce this by patching the code to force a version mismatch and have the following Host in a zone with two Endpoints

object Host "horst" {
    check_command = "dummy"
    check_interval=10s
    max_check_attempts = 1
}

Then, with both instances running, have the Host go down via the API and acknowledge the problem:

curl -k -s -u root:icinga -H 'Accept: application/json' \
 -X POST 'https://localhost:5665/v1/actions/process-check-result' -d \
'{"type": "Host", "filter": "host.name==\"horst\"", "exit_status": 1, "plugin_output": "Host is not available."}'
curl -k -s -u root:icinga -H 'Accept: application/json' \
 -X POST 'https://localhost:5665/v1/actions/acknowledge-problem' -d \
'{"type": "Host", "filter": "host.name==\"horst\"", "author": "admin", "comment": "Ok"}'

Now start restarting one of the masters until the problem occurs. The aim is to restart just when the other master checks the Host, recovers and clears the acknowledgement.

What's this about the version

The object version is a timestamp used to ignore changes to objects which we have a newer version. The patch is only there to trigger the beginning of the loop. Using a timestamp is a good first filter for objects, but considering things may go wrong under high load a newer timestamp might not mean the object is actually different.
This does not mean the issue is with the version per se, the objects bouncing around are in the case of @jhou4 the same through the log.

So what's wrong?!

So far my money was on what I described here but the log from @jhou4 does not match this. I am missing two log messages especially:
notice/ApiListener: Processing config update for object... for updates from the other master and notice/Downtime: Removed downtime ... by RemoveDowntime triggered by the DowntimeExpireHandler.

Solution/Next Steps

I have multiple logs from different sources, all with the same result but all slightly different. This problem is not new, but it is observed by more users and more frequently now. While my handmade config loop can be reproduced with the current master, it may be worth the shot to try out the snapshop packages - if only to give us more informative logs ;)

As for our Object versioning, I have been talking to the IcingaDB team if it could be possible to move the checksum creation and comparison from the specific feature into the objects themself to be used by the ApiListener ConfigSync.

Working on this with @bobapple this week who could create a scenario where to reliably reproduce this.

Spoiler: It is not the version, but OnActiveChanged() with missing message origin object creation, in a HA master setup where the JSON-RPC queues are massively overloaded to move the Create/Delete (from expiry or cancel) after the first master's events.

icinga_downtime_loop_explained

7374 helped quite a bit with better logging from such events. This unveiled that it is solely a cluster config::updateObject loop with timing involved.

We decided to move this into 2.11 as it affects customer environments testing the RC, and we now have a reliable reproducer.

@bobapple tested my patch quite extensively in the past 24 hours.

Full Analysis

Background

Downtime and Comment creation were specific cluster events and handlers prior to 2.4. With the addition of runtime create config objects, we've unified this behaviour into a general cluster message with the config::UpdateObject method.

This sync is located in https://github.com/Icinga/icinga2/blob/master/lib/remote/apilistener-configsync.cpp

Object Create/Update/Delete

Object events include the following and are triggered from OnActiveChanged and OnVersionChanged being the ConfigObject attributes active and version.

SetActive() is called during config object activation after compiling the config item into a config object. SetVersion() is called whenever an object was modified, e.g. through a POST request against /v1/objects or when synced throughout the cluster.

Depending on the object being active, or !active and marked as deleted, specific functions are called.

They prepare the cluster messages being relayed into the current zone.

UpdateConfigObject

First checks whether the target zone should really receive the message.
Reads the config content from the _api package storage. Adds modified attributes if any.
The message is either relayed into the zone, or directly sent to a connected endpoint via JsonRpcConnection.

DeleteConfigObject

First checks whether the target zone should really receive the message.
Creates delete message including the object's version.
The message is either relayed into the zone, or directly sent to a connected endpoint via JsonRpcConnection.

Cluster Receiver API Handlers

These are registered on config::UpdateObject and config::DeleteObject message types.

https://icinga.com/docs/icinga2/snapshot/doc/19-technical-concepts/#configupdateobject
https://icinga.com/docs/icinga2/snapshot/doc/19-technical-concepts/#configdeleteobject

UpdateObject

  • Checks whether the object already exists

    • If not, creates object by fetching the config content, compiling it, committing the config item, and activating the final object. This runs without any message origin (info for later)

  • If there are modified attributes, modifies the object
  • Sets the version on the object from the received version via cluster message (This includes the message origin (hint for later)

DeleteObject

  • Checks whether the object exists. If not, LogNotice with a hint
  • Calls ConfigObjectUtility::DeleteObject() without message origin (hint for later)

Why the loop?

Cluster Routing and Loop Prevention

One fundamental part of the cluster message routing is the MessageOrigin object. This is created when a new JSON-RPC message is received. It contains

  • FromZone being extracted from the endpoint object which owns the JsonRpcConnection
  • FromClient being the JsonRpcConnection bound to the endpoint object

These attributes are checked in message receive api handlers for security access. E.g. whether a message origin is from a child zone which is not allowed, etc.

Whenever such a message is processed on the client, it may trigger additional cluster events which are sent back to other endpoints. Therefore it is key to always pass the MessageOrigin origin when processing these messages locally.

Example:

  • Client receives a CheckResult from another endpoint in the same zone, call it sender for now
  • Calls ProcessCheckResult() to store the CR and calculcate states, notifications, etc.
  • Calls the OnNewCheckResult() signal to trigger IDO updates

OnNewCheckResult() also calls a registered cluster handler which forwards the CheckResult to other cluster members.

Without any origin details, this CheckResult would be relayed to the sender endpoint again.
Which processes the message, ProcessCheckResult(), OnNewCheckResult(), sends back and so on.

That creates a loop which our cluster protocol needs to prevent at all cost.

RelayMessageOne() takes care of the routing. This involves fetching the targetZone for this message and its endpoints.

  • Don't relay messages to ourselves.
  • Don't relay messages to disconnected endpoints.
  • Don't relay the message to the zone through more than one endpoint unless this is our own zone.
  • Don't relay messages back to the endpoint which we got the message from. THIS
  • Don't relay messages back to the zone which we got the message from.
  • Only relay message to the zone master if we're not currently the zone master.
 e1 is zone master, e2 and e3 are zone members.

 Message is sent from e2 or e3:
   !isMaster == true
   targetEndpoint e1 is zone master -> send the message
   targetEndpoint e3 is not zone master -> skip it, avoid routing loops

 Message is sent from e1:
   !isMaster == false -> send the messages to e2 and e3 being the zone routing master.

With passing the origin the following condition prevents sending a message back to sender:

if (origin && origin->FromClient && targetEndpoint == origin->FromClient->GetEndpoint()) {

This message then simply gets skipped for this specific Endpoint and is never sent.

Why the loop for runtime created config objects?

CreateObject() and DeleteObject() don't forward the origin from received cluster messages.

Both function calls result in OnActiveChanged() being called again. This results in sending the cluster message again.

No origin - back to sender, being the first master which created the downtime/comment.

A working environment

m1 -> create downtime
m2 -> receive downtime, OnActiveChanged -> send back to m1
m1 -> has additional guards with object->GetVersion() comparison and also checks for existance

Cluster under stress

Whenever the JSON-RPC messages exchanged between the two master nodes are many, and the json_rpc queues are full - you can monitor this via the icinga check as well as /v1/status REST API endpoint - something weird happens.

  • m1 CREATE object, added to JSON-RPC queue
  • m2 does not receive this yet
  • m1 DELETE, e.g. from Downtime Expire timer handlers - added to JSON-RPC queue
  • m2 finally catches up after a minute or more
  • m2 CREATE object -> no origin, back to sender
  • m1 doesn't receive it yet, JSON-RPC queue still full
  • m2 DELETE object -> no origin, back to sender
  • m1 receives CREATE, processes it (downtimes in the past work) - no oirigin, back to sender
  • m1 receives DELETE, no origin, back to sender
  • m2 ...

You'll get the idea also shown in in this picture.

icinga_downtime_loop_explained

Solution

Always add the MessageOrigin when processing config::* cluster events.

The culprit is that OnActiveChanged() is called without the origin. This needs some dancing inside the code to get there - which is described below.

Trace

Delete

Value ApiListener::ConfigDeleteObjectAPIHandler(const MessageOrigin::Ptr& origin, const Dictionary::Ptr& params)

if (!ConfigObjectUtility::DeleteObject(object, true, errors, nullptr)) {

Update & Create

Value ApiListener::ConfigUpdateObjectAPIHandler(const MessageOrigin::Ptr& origin, const Dictionary::Ptr& params)

Either via

if (!ConfigObjectUtility::CreateObject(ptype, objName, config, errors, nullptr)) {

or

object->SetVersion(objVersion, false, origin);

object->SetVersion(objVersion, false, origin);

-> OnVersionChanged is not triggered at all, since it contains the origin.

OnActiveChanged

OnActiveChanged -> debug//lib/base/configobject-ti.cpp via mkclass compiler

NotifyActive() -> configobject.cpp

ConfigObject

Activate

lib/config/configitem.cpp

bool ConfigItem::ActivateItems(WorkQueue& upq, const std::vector& newItems, bool runtimeCreated, bool silent, bool withModAttrs)

Deactivate

  • void ConfigObject::StopObjects() -> Onshutdown, not deleted.

  • bool ConfigObjectUtility::DeleteObjectHelper(const ConfigObject::Ptr& object, bool cascade,
    const Array::Ptr& errors, const Array::Ptr& diagnosticInformation)

    try {
            /* mark this object for cluster delete event */
            object->SetExtension("ConfigObjectDeleted", true);
            /* triggers signal for DB IDO and other interfaces */
            object->Deactivate(true);
    

Root Cause

NotifyActive() is called without origin cookie.

This needs to be wrapped via

object->Activate(runtimeCreated, cookie) -> ActivateItems -> CreateObject
object->Deactivate(runtimeCreated, cookie)

This bug exists since v2.4.0, so nearly 4 years.

Related issues and tickets:

ref/NC/618973
ref/NC/628495
ref/NC/509037
ref/NC/506213
ref/NC/573719

@jhou4 Since you're on RHEL, I've triggered snapshot packages in our package pipeline.

https://git.icinga.com/packaging/rpm-icinga2/pipelines/4916

Once built, please test the snapshot packages.

7423 adds more logging and fixes the problem with ignored downtime objects (non-existing hosts) with always logging the creation. This ensures that troubleshooting and debugging things like this becomes easier @widhalmt @mwaldmueller

We tested 2.11.0 rpms on Redhat 7, and ran into some major issues:

  • "systemctl start/stop icinga2" commands hang/fail sometimes.
  • After icinga2 starts it is stuck and keeps show messages like:
    [2019-09-25 10:58:39 -0400] warning/HttpServerConnection: Unauthorized request: GET /v1/status
    And not the normal messages which indicating Icinga engine is working correctly.
  • Errors like:
    [2019-09-25 10:40:29 -0400] critical/ApiListener: Could not create object 'aletc-ob-990!7cabe62c-bba1-4e4c-816b-2418f2b19096':
    [2019-09-25 10:40:29 -0400] critical/ApiListener: Config package broken: Error: Cannot repair package '_api', please check the troubleshooting docs.

I suspect those were from the significant rework done in version 2.11.0. We do have our own workaround for this downtime bug in place. So we will probably wait some time before upgrading.

If you can reproduce these problems, please collect all valuable details about your environment and create separate issues for each problem, please.

Unfortunately I have a Dev icinga cluster but it doesn't have the same workload as prod so I can't re-produce the problem easily. I searched existing issues with 2.11, think the problem I ran into are similar to a few others already opened. So I will just keep checking those. Thanks.

Okidoki, that's just fine. I only want to ensure that you're not running into something we do not know, and later on, you hit exactly that even when expecting it being fixed.

Was this page helpful?
0 / 5 - 0 ratings