Icinga2: Remote Icinga instance 'agent1.example.com' is not connected to 'master.example.com' after deploy

Created on 29 Oct 2019  路  22Comments  路  Source: Icinga/icinga2

Describe the bug

When deploying new configs from director, the master is shutdown and the 'api' listener stopped resulting in tons of UNKNOWN status from the agents (Remote Icinga instance 'agent1.example.com' is not connected to 'master.example.com').

To Reproduce

This is a master with agents setup. The majority of checks are run on the agents.

  1. Press "deploy" from director or launch #kill -HUP /run/icinga2/icinga2.pid
  2. #tail -f /var/log/icinga2/icinga2.log on master
  3. #tail -f /var/log/icinga2/icinga2.log on agent1

extract of master log

...
2019-10-29 14:27:19 +0100] information/HttpServerConnection: HTTP client disconnected (from [master IP]:54000)
[2019-10-29 14:27:21 +0100] information/Application: Received request to shut down.
[2019-10-29 14:27:21 +0100] information/Application: Shutting down...
[2019-10-29 14:27:22 +0100] information/ApiListener: New client connection from [master IP]:54016 (no client certificate)
[2019-10-29 14:27:22 +0100] information/HttpServerConnection: Request: GET /v1/ (from [master IP]:54016), user: director, agent: ).
[2019-10-29 14:27:22 +0100] information/HttpServerConnection: Request: GET /v1/config/packages (from [master IP]:54016), user: director, agent: ).
[2019-10-29 14:27:22 +0100] information/HttpServerConnection: Request: GET /v1/config/stages/director/bfdbfadb-1bd2-4c26-98ce-c97584593619 (from [master IP]:54016), user: director, agent: ).
[2019-10-29 14:27:22 +0100] information/HttpServerConnection: Request: GET /v1/config/files/director/bfdbfadb-1bd2-4c26-98ce-c97584593619/status (from [master IP]:54016), user: director, agent: ).
[2019-10-29 14:27:22 +0100] information/HttpServerConnection: Request: GET /v1/config/files/director/bfdbfadb-1bd2-4c26-98ce-c97584593619/startup.log (from [master IP]:54016), user: director, agent: ).
[2019-10-29 14:27:22 +0100] information/HttpServerConnection: Request: GET /v1/config/packages (from [master IP]:54016), user: director, agent: ).
[2019-10-29 14:27:22 +0100] information/HttpServerConnection: Request: GET /v1/config/packages (from [master IP]:54016), user: director, agent: ).
[2019-10-29 14:27:22 +0100] information/HttpServerConnection: HTTP client disconnected (from [master IP]:54016)
[2019-10-29 14:27:23 +0100] information/CheckerComponent: 'checker' stopped.
[2019-10-29 14:27:23 +0100] information/NotificationComponent: 'notification' stopped.
[2019-10-29 14:27:23 +0100] information/InfluxdbWriter: 'influxdb' paused.
[2019-10-29 14:27:23 +0100] information/DbConnection: Pausing IDO connection: ido-pgsql
[2019-10-29 14:27:23 +0100] information/IdoPgsqlConnection: Disconnected from 'ido-pgsql' database 'icinga'.
[2019-10-29 14:27:23 +0100] information/IdoPgsqlConnection: 'ido-pgsql' paused.
[2019-10-29 14:27:23 +0100] information/DbConnection: 'ido-pgsql' stopped.
[2019-10-29 14:27:23 +0100] information/ExternalCommandListener: 'command' stopped.
[2019-10-29 14:27:23 +0100] information/ApiListener: 'api' stopped.
[2019-10-29 14:27:24 +0100] information/FileLogger: 'main-log' started.
[2019-10-29 14:27:24 +0100] information/ApiListener: 'api' started.
[2019-10-29 14:27:24 +0100] information/ApiListener: Copying 1 zone configuration files for zone 'global-templates' to '/var/lib/icinga2/api/zones/global-templates'.
[2019-10-29 14:27:24 +0100] information/ApiListener: Updating configuration file: /var/lib/icinga2/api/zones/global-templates//_etc/commands.conf
...

extract of agent1 log

...
[2019-10-29 14:27:21 +0100] information/RemoteCheckQueue: items: 0, rate: 0/s (24/min 120/5min 360/15min);
[2019-10-29 14:27:23 +0100] warning/JsonRpcConnection: API client disconnected for identity 'master.example.com'
[2019-10-29 14:27:23 +0100] warning/ApiListener: Removing API client for endpoint 'master.example.com'. 0 API clients left.
[2019-10-29 14:27:24 +0100] information/ApiListener: Reconnecting to endpoint 'master.example.com' via host 'master IP' and port '5665'
[2019-10-29 14:27:24 +0100] information/ApiListener: New client connection for identity 'master.example.com' to [master IP]:5665
[2019-10-29 14:27:24 +0100] information/ApiListener: Requesting new certificate for this Icinga instance from endpoint 'master.example.com'.
[2019-10-29 14:27:24 +0100] information/ApiListener: Sending config updates for endpoint 'master.example.com' in zone 'master'.
[2019-10-29 14:27:24 +0100] information/ApiListener: Finished sending config file updates for endpoint 'master.example.com' in zone 'master'.
[2019-10-29 14:27:24 +0100] information/ApiListener: Syncing runtime objects to endpoint 'master.example.com'.
[2019-10-29 14:27:24 +0100] information/ApiListener: Finished syncing runtime objects to endpoint 'master.example.com'.
[2019-10-29 14:27:24 +0100] information/ApiListener: Finished sending runtime config updates for endpoint 'master.example.com' in zone 'master'.
[2019-10-29 14:27:24 +0100] information/ApiListener: Sending replay log for endpoint 'master.example.com' in zone 'master'.
[2019-10-29 14:27:24 +0100] information/ApiListener: Finished sending replay log for endpoint 'master.example.com' in zone 'master'.
[2019-10-29 14:27:24 +0100] information/ApiListener: Finished syncing endpoint 'master.example.com' in zone 'master'.
[2019-10-29 14:27:24 +0100] information/ApiListener: Finished reconnecting to endpoint 'master.example.com' via host 'master IP' and port '5665'
[2019-10-29 14:27:24 +0100] information/ApiListener: Applying config update from endpoint 'master.example.com' of zone 'master'.
[2019-10-29 14:27:24 +0100] information/ApiListener: Received configuration for zone 'director-global' from endpoint 'master.example.com'. Comparing the timestamp and checksums.
[2019-10-29 14:27:24 +0100] information/ApiListener: Stage: Updating received configuration file '/var/lib/icinga2/api/zones-stage/director-global//director/001-director-basics.conf' for zone 'director-global'.
[2019-10-29 14:27:24 +0100] information/ApiListener: Stage: Updating received configuration file '/var/lib/icinga2/api/zones-stage/director-global//director/commands.conf' for zone 'director-global'.
...

Expected behavior

Reloading the master configuration should not shutdown the master or at least not stop the api listener

Screenshots

Screenshot from 2019-10-29 14-54-58 - copie

Your Environment

Include as many relevant details about the environment you experienced the problem in

  • Version used : 2.11.2-1
  • Operating System and version: RHEL 7.7
  • Enabled features : api checker command ido-pgsql influxdb mainlog notification
  • Icinga Web 2 version and modules: 2.7.3, director 1.7.0, doc 2.7.3, fileshipper 1.1.0, grafana 1.3.5, incubator 0.5.0, ipl v0.3.0, monitoring 2.7.3, reactbundle 0.7.0
  • Config validation (icinga2 daemon -C):

[2019-10-29 15:47:09 +0100] information/cli: Icinga application loader (version: 2.11.2-1)
[2019-10-29 15:47:09 +0100] information/cli: Loading configuration file(s).
[2019-10-29 15:47:09 +0100] information/ConfigItem: Committing config item(s).
[2019-10-29 15:47:09 +0100] information/ApiListener: My API identity: master.example.com
[2019-10-29 15:47:09 +0100] information/ConfigItem: Instantiated 1 InfluxdbWriter.
[2019-10-29 15:47:09 +0100] information/ConfigItem: Instantiated 23 ScheduledDowntimes.
[2019-10-29 15:47:09 +0100] information/ConfigItem: Instantiated 1 FileLogger.
[2019-10-29 15:47:09 +0100] information/ConfigItem: Instantiated 8 NotificationCommands.
[2019-10-29 15:47:09 +0100] information/ConfigItem: Instantiated 1 NotificationComponent.
[2019-10-29 15:47:09 +0100] information/ConfigItem: Instantiated 3767 Notifications.
[2019-10-29 15:47:09 +0100] information/ConfigItem: Instantiated 1 IcingaApplication.
[2019-10-29 15:47:09 +0100] information/ConfigItem: Instantiated 25 HostGroups.
[2019-10-29 15:47:09 +0100] information/ConfigItem: Instantiated 116 Hosts.
[2019-10-29 15:47:09 +0100] information/ConfigItem: Instantiated 1 EventCommand.
[2019-10-29 15:47:09 +0100] information/ConfigItem: Instantiated 23 Downtimes.
[2019-10-29 15:47:09 +0100] information/ConfigItem: Instantiated 20 Comments.
[2019-10-29 15:47:09 +0100] information/ConfigItem: Instantiated 1 CheckerComponent.
[2019-10-29 15:47:09 +0100] information/ConfigItem: Instantiated 93 Zones.
[2019-10-29 15:47:09 +0100] information/ConfigItem: Instantiated 91 Endpoints.
[2019-10-29 15:47:09 +0100] information/ConfigItem: Instantiated 1 ExternalCommandListener.
[2019-10-29 15:47:09 +0100] information/ConfigItem: Instantiated 4 ApiUsers.
[2019-10-29 15:47:09 +0100] information/ConfigItem: Instantiated 1 UserGroup.
[2019-10-29 15:47:09 +0100] information/ConfigItem: Instantiated 1 ApiListener.
[2019-10-29 15:47:09 +0100] information/ConfigItem: Instantiated 247 CheckCommands.
[2019-10-29 15:47:09 +0100] information/ConfigItem: Instantiated 1 IdoPgsqlConnection.
[2019-10-29 15:47:09 +0100] information/ConfigItem: Instantiated 4 TimePeriods.
[2019-10-29 15:47:09 +0100] information/ConfigItem: Instantiated 18 Users.
[2019-10-29 15:47:09 +0100] information/ConfigItem: Instantiated 987 Services.
[2019-10-29 15:47:09 +0100] information/ConfigItem: Instantiated 8 ServiceGroups.
[2019-10-29 15:47:09 +0100] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2019-10-29 15:47:09 +0100] information/cli: Finished validating the configuration file(s).

zones.conf

// Master endpoint/zone definitions
object Endpoint "master.example.conf" {
host = "master.example.conf"
}
object Zone "master" {
endpoints = [ "master.example.conf" ]
}

// Global zone for distributed setups
object Zone "global-templates" {
global = true
}
// Global zone for distributed Director setups
object Zone "director-global" {
global = true
}

extract of agent_endpoints.conf

object Endpoint "agent1.example.com" {
log_duration = 0s
}

Add any other context about the problem here.

aredistributed bug reIP reNC

Most helpful comment

Yes, we are currently evaluating internally how we will proceed. Get back about this within a week.

All 22 comments

This is not a bug but the default behavior of icinga.

Thanks for the link to the doc.
The handling of the api listener is not clearly detailed.
Could it be kept alive while reloading the master?

As the parent process is shutdown and the new child process takes some time to start and take over the tasks some seconds without api listener is the result.
BTW: It's going off-topic here, for discussions/questions it's recommended to head over to The Icinga community

According to our observation it takes a lot more time until everything is reconnected since 2.11.
We had to change retry_intervals and max_check_attempts of our cluster checks to tolerate a disconnect for up to 10 minutes, before 2.11 it always took less than 3 minutes.
@dfrise I would guess that maybe you just noticed this behavior now because it takes longer than before 馃

Duplicate of #7533

Should have been fixed with 2.11-2, but it seems it is not unfortunately

Can you extract the content of /var/lib/icinga2/api/log and watch the replay log being actually sent to parent instances? The files should be deleted after successful connect, maybe they are just left around. This may have an influence here.

Also, a possible workaround is to add log_duration=0 to the parent endpoint object on the agent in zones.conf. (just a guess, haven't tried it yet).

Content of /var/lib/icinga2/api/log after a kill -HUP on the master at 17:07:

$ ll
total 0
-rw-r--r--. 1 icinga icinga 0 Oct 31 17:07 current

On the agent :

$ ll
total 0
-rw-r--r--. 1 icinga icinga 0 Oct 27 16:46 1572504514
-rw-r--r--. 1 icinga icinga 0 Oct 31 07:48 1572504906
-rw-r--r--. 1 icinga icinga 0 Oct 31 07:55 current

log_duration=0 is already set

Any news on this?
Another unwanted side effect is that after a deploy, services that were acknowledged will again fire alerts because of the state change to UNKNOWN.

Yes, here also same problem. We are turning notifications off globally while deploying :S

@dnsmichi the issue in #7533 isn't resolved. At least not the issue we are experiencing. Is there something how we can help to get to the bottom of this ?

From a release perspective, this issue _is_ resolved. It targets a milestone which already is released. That's how our project and release management works.

In regard of the issue - which I consider the same here - I don't have any suggestions yet. The best bet would be to contact an Icinga partner for faster support. I'm only cleaning up the issues in this repository at the very moment, doing the (boring) project management tasks.

I think we have the same problem since 2.11 upgrade. While everything else got stable now we have quite many of these issues. (We have no HA setup, just one master and three sattelites in different zones until now.)

In our case the disonnects are not that long that the unknow state gets hard/trigger notifications, but events sticking at warning it triggers notifications then (we get notifications for anything above warning normally).
Besides that, it's bad for our reports based on the check results and difficult for the day-support that needs to keep an eye on the event-list.

So I see no ideas/no plans to mitigate this issue? :-(
I did not look into code but somehow it must be possible to postpone a scheduled check after a relad before connection to target agent is reestablished?

@Elias481 I have even contacted Netways about this. We want to sponsor to get this issue fixed. But they don't respond to my e-mail anymore, so I have no idea what is going on..

(CC: @dnsmichi)

Hi,

We're still investigating how to fix this issue. We'll keep you posted with more details once we have them.

Best,
Eric

Thanks Eric! Looking forward to your findings

Thank you Eric.
We are migrating from Xymon to Icinga and wiling to help you debug this issue if needed.

same problems here, we simply reload the icinga2 service and BOOM all seems to be disconnected and has to connect again. Sometimes also resulting in really high load on the monitoring server. It started all with latest major release afaik :)

ref/NC/659099

I have even contacted Netways about this. We want to sponsor to get this issue fixed. But they don't respond to my e-mail anymore, so I have no idea what is going on..

We're waiting for @sharkyzz

ref/IP/20817

Yes, we are currently evaluating internally how we will proceed. Get back about this within a week.

Let's proceed with this. I will formalize it by mail.

Handing this over to @htriem for the final test updates.

Was this page helpful?
0 / 5 - 0 ratings