Icinga2: Evaluate smart node reconnect

Created on 18 Apr 2018 · 10Comments · Source: Icinga/icinga2

Ongoing task, issue will be updated.

Superseedes #5588

Expected Behavior

Have a smart way of reconnecting a node without the need of a defined static interval. Lower it to 1 sec, test with 5 sec, 10 sec, or 30 sec on a dynamic basis. Don't let nodes which fail all the time to attempt to connect that often.

Current Behavior

Currently a node reconnects at a given interval of 60 seconds. This is due to the fact that lowering the interval may harm the performance of an instance with many child node connections. In situations where the master reloads many times from config deployments, this causes a problem with a wait time of 60 seconds until the connection is re-established.

Possible Solution

Lower the reconnect interval in a smart way
Throttle clients which cannot reconnect i.e. TLS handshakes fail more than X times, or the endpoint object is missing
Let the user know by logging and optionally provide metrics via API
connection rates and limits (?)
tests in large environments

TBD aredistributed enhancement queuwishlist

Source

dnsmichi

Most helpful comment

I suggest going for some type of backoff: by default try to connect every 1s and when it's not working double that interval every time until it reaches 60s. 1s, 2s, 4s, 8s, 16s, 32s, 64s, 64s, 64s

This is actually a big issue in our setup where we have frequent config reloads on the master, also with big configs (lots of objects) and icinga2 reload can take more than 1min, and a client may miss this and has to wait another 1min resulting in 2min of a down client (we use cluster-zone to check host up).

marcofl on 10 May 2018

👍3

All 10 comments

Assigning to 2.9 as ongoing task, but no promises here.

dnsmichi on 25 Apr 2018

We have a icinga client which is connected via a GSM connection, so the upload (and download) bandwidth is very limited. Additionally we have outages of several hours where the GSM modem cannot get a connection at all (~1 per month).

I am not sure which is the best way to integrate these "slow" nodes.

markusr on 27 Apr 2018

Low bandwidths are not part of this issue, that's #3387 you are looking for. This issue is merely to refine the current static 60s reconnect interval and evaluate possible routes.

dnsmichi on 27 Apr 2018

I suggest going for some type of backoff: by default try to connect every 1s and when it's not working double that interval every time until it reaches 60s. 1s, 2s, 4s, 8s, 16s, 32s, 64s, 64s, 64s

marcofl on 10 May 2018

👍3

This is something for after 2.9, there's more work underway for the remaining issues.

dnsmichi on 28 May 2018

This is blocked by socket IO handling problems, and possible leaks. More on that matter soon.

dnsmichi on 8 Jun 2018

Just a stupid question of mine (came up while re-thinking the own Icinga 2 cluster topology at home):

Why regular reconnects at all? And why one-way at all?

Either an agent has to wait for a connection from above to deliver the last check result or the latest config changes and check-now requests aren't applied ad-hoc as an agent doesn't connect shortly enough.

My suggestion:

Ad-hoc connections, i.e. if I have pending messages for agent XY, I just connect.
Two-way connection between master and agent, i.e. Endpoint#host (and Endpoint#port) are configured on both sides and whoever connects first to the other side... first come – first serve. Even if there are two connections at the same time – so what. Request on connection one, response on connection two. This shouldn't be much trouble.

Al2Klimov on 17 Aug 2018

This is blocked by #6517.

dnsmichi on 14 Sep 2018

Why do we need this at all?

It is not necessary that both the master and the client node establish two connections to each other. Icinga 2 will only use one connection and close the second connection if established.

– https://icinga.com/docs/icinga2/latest/doc/06-distributed-monitoring/#endpoint-connection-direction

Al2Klimov on 12 Nov 2018

I'd like to see how the network stack rewrite works in the wild, prior to investing time and tests with an improved algorithm here. Since there's popular demand for releasing 2.11 soon, I am rescheduling this for 2.12.

dnsmichi on 13 May 2019

Was this page helpful?

0 / 5 - 0 ratings