Icinga2: Evaluate smart node reconnect

Created on 18 Apr 2018  Â·  10Comments  Â·  Source: Icinga/icinga2

Ongoing task, issue will be updated.

Superseedes #5588

Expected Behavior


Have a smart way of reconnecting a node without the need of a defined static interval. Lower it to 1 sec, test with 5 sec, 10 sec, or 30 sec on a dynamic basis. Don't let nodes which fail all the time to attempt to connect that often.

Current Behavior


Currently a node reconnects at a given interval of 60 seconds. This is due to the fact that lowering the interval may harm the performance of an instance with many child node connections. In situations where the master reloads many times from config deployments, this causes a problem with a wait time of 60 seconds until the connection is re-established.

Possible Solution


  • Lower the reconnect interval in a smart way
  • Throttle clients which cannot reconnect i.e. TLS handshakes fail more than X times, or the endpoint object is missing
  • Let the user know by logging and optionally provide metrics via API
  • connection rates and limits (?)
  • tests in large environments
TBD aredistributed enhancement queuwishlist

Most helpful comment

I suggest going for some type of backoff: by default try to connect every 1s and when it's not working double that interval every time until it reaches 60s. 1s, 2s, 4s, 8s, 16s, 32s, 64s, 64s, 64s

This is actually a big issue in our setup where we have frequent config reloads on the master, also with big configs (lots of objects) and icinga2 reload can take more than 1min, and a client may miss this and has to wait another 1min resulting in 2min of a down client (we use cluster-zone to check host up).

All 10 comments

Assigning to 2.9 as ongoing task, but no promises here.

We have a icinga client which is connected via a GSM connection, so the upload (and download) bandwidth is very limited. Additionally we have outages of several hours where the GSM modem cannot get a connection at all (~1 per month).

I am not sure which is the best way to integrate these "slow" nodes.

Low bandwidths are not part of this issue, that's #3387 you are looking for. This issue is merely to refine the current static 60s reconnect interval and evaluate possible routes.

I suggest going for some type of backoff: by default try to connect every 1s and when it's not working double that interval every time until it reaches 60s. 1s, 2s, 4s, 8s, 16s, 32s, 64s, 64s, 64s

This is actually a big issue in our setup where we have frequent config reloads on the master, also with big configs (lots of objects) and icinga2 reload can take more than 1min, and a client may miss this and has to wait another 1min resulting in 2min of a down client (we use cluster-zone to check host up).

This is something for after 2.9, there's more work underway for the remaining issues.

This is blocked by socket IO handling problems, and possible leaks. More on that matter soon.

Just a stupid question of mine (came up while re-thinking the own Icinga 2 cluster topology at home):

Why regular reconnects at all? And why one-way at all?

Either an agent has to wait for a connection from above to deliver the last check result or the latest config changes and check-now requests aren't applied ad-hoc as an agent doesn't connect shortly enough.

My suggestion:

  1. Ad-hoc connections, i.e. if I have pending messages for agent XY, I just connect.
  2. Two-way connection between master and agent, i.e. Endpoint#host (and Endpoint#port) are configured on both sides and whoever connects first to the other side... first come – first serve. Even if there are two connections at the same time – so what. Request on connection one, response on connection two. This shouldn't be much trouble.

This is blocked by #6517.

Why do we need this at all?

It is not necessary that both the master and the client node establish two connections to each other. Icinga 2 will only use one connection and close the second connection if established.

– https://icinga.com/docs/icinga2/latest/doc/06-distributed-monitoring/#endpoint-connection-direction

I'd like to see how the network stack rewrite works in the wild, prior to investing time and tests with an improved algorithm here. Since there's popular demand for releasing 2.11 soon, I am rescheduling this for 2.12.

Was this page helpful?
0 / 5 - 0 ratings