Ongoing task, issue will be updated.
Superseedes #5588
Have a smart way of reconnecting a node without the need of a defined static interval. Lower it to 1 sec, test with 5 sec, 10 sec, or 30 sec on a dynamic basis. Don't let nodes which fail all the time to attempt to connect that often.
Currently a node reconnects at a given interval of 60 seconds. This is due to the fact that lowering the interval may harm the performance of an instance with many child node connections. In situations where the master reloads many times from config deployments, this causes a problem with a wait time of 60 seconds until the connection is re-established.
Assigning to 2.9 as ongoing task, but no promises here.
We have a icinga client which is connected via a GSM connection, so the upload (and download) bandwidth is very limited. Additionally we have outages of several hours where the GSM modem cannot get a connection at all (~1 per month).
I am not sure which is the best way to integrate these "slow" nodes.
Low bandwidths are not part of this issue, that's #3387 you are looking for. This issue is merely to refine the current static 60s reconnect interval and evaluate possible routes.
I suggest going for some type of backoff: by default try to connect every 1s and when it's not working double that interval every time until it reaches 60s. 1s, 2s, 4s, 8s, 16s, 32s, 64s, 64s, 64s
This is actually a big issue in our setup where we have frequent config reloads on the master, also with big configs (lots of objects) and icinga2 reload can take more than 1min, and a client may miss this and has to wait another 1min resulting in 2min of a down client (we use cluster-zone to check host up).
This is something for after 2.9, there's more work underway for the remaining issues.
This is blocked by socket IO handling problems, and possible leaks. More on that matter soon.
Just a stupid question of mine (came up while re-thinking the own Icinga 2 cluster topology at home):
Why regular reconnects at all? And why one-way at all?
Either an agent has to wait for a connection from above to deliver the last check result or the latest config changes and check-now requests aren't applied ad-hoc as an agent doesn't connect shortly enough.
My suggestion:
Endpoint#host (and Endpoint#port) are configured on both sides and whoever connects first to the other side... first come – first serve. Even if there are two connections at the same time – so what. Request on connection one, response on connection two. This shouldn't be much trouble.This is blocked by #6517.
Why do we need this at all?
It is not necessary that both the master and the client node establish two connections to each other. Icinga 2 will only use one connection and close the second connection if established.
– https://icinga.com/docs/icinga2/latest/doc/06-distributed-monitoring/#endpoint-connection-direction
I'd like to see how the network stack rewrite works in the wild, prior to investing time and tests with an improved algorithm here. Since there's popular demand for releasing 2.11 soon, I am rescheduling this for 2.12.
Most helpful comment
I suggest going for some type of backoff: by default try to connect every 1s and when it's not working double that interval every time until it reaches 60s. 1s, 2s, 4s, 8s, 16s, 32s, 64s, 64s, 64s
This is actually a big issue in our setup where we have frequent config reloads on the master, also with big configs (lots of objects) and icinga2 reload can take more than 1min, and a client may miss this and has to wait another 1min resulting in 2min of a down client (we use cluster-zone to check host up).