We're facing an issue with failed config reloads due to uncomplete syncs of our global-templates zone.
But this only happens in a zone with a second satellite hierarchy (Master -> Satellite -> Satellite). Satellites without Childs are not affected by this.
We can only fix this by purging /var/lib/icinga2/api/zones and /var/lib/icinga2/api/zones-stage on the satellites.
This only happens on object creation or deletion. Changes on already existing objects does not trigger this issue.
Error Message Example:
Error: Function call 'opendir' for file '/var/lib/icinga2/api/zones-stage//global-templates/_etc/credentials' failed with error code 2, 'No such file or directory'
Working Config sync and reload on all Icinga nodes
Include as many relevant details about the environment you experienced the problem in
icinga2 --version):icinga2 - The Icinga 2 network monitoring daemon (version: 2.11.2-1)
Copyright (c) 2012-2020 Icinga GmbH (https://icinga.com/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
System information:
Platform: Red Hat Enterprise Linux Server
Platform version: 7.7 (Maipo)
Kernel: Linux
Kernel version: 3.10.0-1062.1.1.el7.x86_64
Architecture: x86_64
Build information:
Compiler: GNU 4.8.5
Build host: runner-LTrJQZ9N-project-322-concurrent-0
Application information:
General paths:
Config directory: /etc/icinga2
Data directory: /var/lib/icinga2
Log directory: /var/log/icinga2
Cache directory: /var/cache/icinga2
Spool directory: /var/spool/icinga2
Run directory: /run/icinga2
Old paths (deprecated):
Installation root: /usr
Sysconf directory: /etc
Run directory (base): /run
Local state directory: /var
Internal paths:
Package data directory: /usr/share/icinga2
State path: /var/lib/icinga2/icinga2.state
Modified attributes path: /var/lib/icinga2/modified-attributes.conf
Objects path: /var/cache/icinga2/icinga2.debug
Vars path: /var/cache/icinga2/icinga2.vars
PID path: /run/icinga2/icinga2.pid
icinga2 feature list): Disabled features: compatlog debuglog elasticsearch gelf graphite influxdb livestatus notification opentsdb perfdata statusdata syslog
Enabled features: api checker command mainlog
icinga2 daemon -C):[2020-01-08 11:47:29 +0100] information/cli: Icinga application loader (version: 2.11.2-1)
[2020-01-08 11:47:29 +0100] information/cli: Loading configuration file(s).
[2020-01-08 11:47:29 +0100] information/ConfigItem: Committing config item(s).
[2020-01-08 11:47:29 +0100] information/ApiListener: My API identity: sattelite.domain.com
... (only some apply rules without matches on this satellite)
[2020-01-08 11:47:31 +0100] information/ConfigItem: Instantiated 1 FileLogger.
[2020-01-08 11:47:31 +0100] information/ConfigItem: Instantiated 743 Dependencies.
[2020-01-08 11:47:31 +0100] information/ConfigItem: Instantiated 8 NotificationCommands.
[2020-01-08 11:47:31 +0100] information/ConfigItem: Instantiated 2722 Notifications.
[2020-01-08 11:47:31 +0100] information/ConfigItem: Instantiated 1 IcingaApplication.
[2020-01-08 11:47:31 +0100] information/ConfigItem: Instantiated 173 HostGroups.
[2020-01-08 11:47:31 +0100] information/ConfigItem: Instantiated 213 Hosts.
[2020-01-08 11:47:31 +0100] information/ConfigItem: Instantiated 32 Downtimes.
[2020-01-08 11:47:31 +0100] information/ConfigItem: Instantiated 5 Comments.
[2020-01-08 11:47:31 +0100] information/ConfigItem: Instantiated 1 CheckerComponent.
[2020-01-08 11:47:31 +0100] information/ConfigItem: Instantiated 4 Zones.
[2020-01-08 11:47:31 +0100] information/ConfigItem: Instantiated 6 Endpoints.
[2020-01-08 11:47:31 +0100] information/ConfigItem: Instantiated 1 ExternalCommandListener.
[2020-01-08 11:47:31 +0100] information/ConfigItem: Instantiated 6 UserGroups.
[2020-01-08 11:47:31 +0100] information/ConfigItem: Instantiated 1 ApiListener.
[2020-01-08 11:47:31 +0100] information/ConfigItem: Instantiated 307 CheckCommands.
[2020-01-08 11:47:31 +0100] information/ConfigItem: Instantiated 8 TimePeriods.
[2020-01-08 11:47:31 +0100] information/ConfigItem: Instantiated 11 Users.
[2020-01-08 11:47:31 +0100] information/ConfigItem: Instantiated 1017 Services.
[2020-01-08 11:47:31 +0100] information/ConfigItem: Instantiated 35 ServiceGroups.
[2020-01-08 11:47:32 +0100] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2020-01-08 11:47:32 +0100] information/cli: Finished validating the configuration file(s).
zones.conf file (or icinga2 object list --type Endpoint and icinga2 object list --type Zone) from all affected nodes.N/A
Please also share the output of ls -lahR /var/lib/icinga2/api/zones-stage/ of the affected satellite host.
The attached file is from one of our "middle" Satellites (from the "aws-frankfurt-satellite" zone). I hope is enough and helps. Let me know if you need the output from all 4 affected satellites. (anonymizing is always a little bit difficult)
We were facing the same issue: https://community.icinga.com/t/global-configuration-zone-missing-check-commands/2976/6
Should you need more debugging data, we would be happy to switch our config sync back to Icinga2 and send you logfiles.
Note: The example error message implies an unexpectedly changed FS tree, but we definitively lock /var/lib/icinga2/api/zones-stage exclusively, so only one changes it at a time.
Hello @Clasko and thank you for reporting!
independent of this issue you should upgrade to v2.11.3 not to have a lot of other trouble.
Best,
AK
@Clasko Please could you test v2.11.3 + #7917: https://git.icinga.com/packaging/rpm-icinga2/-/jobs/45459 / "Job artifacts" / "Download"
I'm on vacation the next 2 weeks. I will see if a colleague can do the testing.
If they can't and the artifacts disappear – just let me know once you'll be going to do the tests and I'll re-create the artifacts.
@Clasko I've made an even better fix:
I've tested with 2.12.0.rc1.2.g38f3108 but the issue is sadly still there.
[2020-03-30 12:55:53 +0200] critical/ApiListener: Config validation failed for staged cluster config sync in '/var/lib/icinga2/api/zones-stage/'. Aborting. Logs: '/var/lib/icinga2/api/zones-stage//startup.log'
I will try to catch the startup.log. Not sure why this log file is often (but not always) missing.
Do you now get only that error message or also e.g. Error: Function call 'opendir' for file '/var/lib/icinga2/api/zones-stage//global-templates/_etc/credentials' failed with error code 2, 'No such file or directory'?
Btw. you could try to catch the startup.log by blocking all :5665 traffic to all nodes ex. one parent node.
Or even better: I've added 2eff3050c to #7936 not to let the startup.log disappear. At the moment we're working on RPMs for you.
I've tested with the latest RPMs. Now i get [2020-04-02 16:38:58 +0200] critical/config: Error: Import references unknown template: 'ldc-host-windows-nrpe'.
But the templates are definitely synced:
[root@satellite]# grep ldc-host-windows-nrpe /var/lib/icinga2/api/zones-stage/global-templates/_etc/templates/customer/hosts_templates.conf
template Host "ldc-host-windows-nrpe" {
But i don't see Error: Function call 'opendir' for file... errors anymore.
Did you upgrade all of the nodes to the same version? If no, please share the Icinga 2 versions of all nodes in both the zone of the affected node and all parent zones.
Also please share the output of find /var/lib/icinga2/api/zones* -name .authoritative.
Also: Which zones do you have config for and in which dir on the affected node?
ls /etc/icinga2/zones.d /var/lib/icinga2/api/zones*
I've upgrades all (from my point of view) affected nodes.
Which means:
Master: 2.11.3-1
Affected Satellites under master: 2.12.0-rc1-3-g2eff305
2.12.0-rc1-3-g2eff305The issue currently occurs only on this 4 (2 HA zones) Satellites. We have other Satellites als childs of our master which a not affected by this issue. These satellites are on version 2.11.2-1.
Output of the find command on our config master. The output on the affected satellites is empty:
/var/lib/icinga2/api/zones/global-templates/.authoritative
/var/lib/icinga2/api/zones/aws-frankfurt-satellite/.authoritative (affected zone)
/var/lib/icinga2/api/zones/customer1-satellite-nes/.authoritative
/var/lib/icinga2/api/zones/customer2-satellite-kar/.authoritative
/var/lib/icinga2/api/zones/config-ha-master/.authoritative
/var/lib/icinga2/api/zones/customer3-satellite/.authoritative (affected zone)
/var/lib/icinga2/api/zones/customer4-satellite/.authoritative
/var/lib/icinga2/api/zones/customer5-satellite/.authoritative
I can not share an unanonymizing output of our zone names as it contains customer names on GitHub.
We placing the configs directly in /etc/icinga2/zones.d/customer3-satellite. "customer3-satellite" is a child of "aws-frankfurt-sallite" (which is a child of "config-ha-master").
Sorry if my answer are a bit confusing due to anonymizing my outputs. I can provide raw output if you can provide me a nextcloud filedrop link or via netways ticket #663455 if this helps.
/etc/icinga2/zones.d?to 1) I know and follow this rule on stable releases but i'm a bit careful with Snapshot or even RPMs directly from the master branch in a production enviroment if not absolutly necessary. I will try to reproduce the issue in our test enviroment but i had no luck in the past. I will reconsider when my next attempts fails again.
to 3) No, none of our satellites has local configurations in /etc/icinga2/zones.d
3: Fine. One point of failure fewer.
1: Snapshots are RPMs directly from the master branch. But my RPMs are neither of those. I know customers' stability requirements and you can fully trust me: If I say "This packages contain version X + PR Y", the packages won't any line of code more.
I've upgrades our two masters to 2.12.0-rc1-3-g2eff305 and now i'm unable to reproduce the issue! Looks good so far, thank you! :)
ref/NC/663455
The have been three problems: