This issue has been migrated from Redmine: https://dev.icinga.com/issues/10435
Created by penyilas on 2015-10-22 13:12:26 +00:00
Assignee: _(none)_
Status: _New_
Target Version: _(none)_
Last Update: _2016-11-09 14:52:15 +00:00 (in Redmine)_
Icinga Version: 2.3.11
Backport?: Not yet backported
Include in Changelog: 1
Hi,
After I was installed 2.3.11, i've attached back the two remining nodes into cluster and it looks fine at first...
But i've realized, the service checks start "lagging" eg. the time between two servicechecks is more than check_interval. Sometimes 3-4 times more.
Just for sure, i've added an nrpe date check with 5s check_interval, and the results are same like cluster heartbeat checks based on ITIL (usually 15-20s between two checks).
On a live environment sometimes (with usual 5m check_interval) the elapsed time is more than 20mins.
If I removed two nodes from zone (stopping them won't enough), everything's back on normal...
My test environment is same as before:
2 master (zone master), 3 satellites (zone icinga)
Configs are same like https://dev.icinga.org/issues/10131
I've attached stack traces, if you need anything else, just let me know...
Attachments
Relations:
Updated by sudeshkumar on 2016-01-14 14:58:31 +00:00
I too have related issue. My setup is three node cluster in a single zone. By random the check results of any one of the node are not syncing. I have enabled debug log and confirmed that, the check is happening but the check results are not syncing.
I don't see relay message entries "notice/ApiListener: Relaying" in the debug log of affected node when this issue happened. When I gone through the code, it seems the check results are pushed into m_RelayQueue and not processed. Also I can see the workqueue size keeps on increasing in the affected node. Pls find the attached sreenshot.
Updated by sudeshkumar on 2016-01-18 13:46:16 +00:00
For some reason the "m_Spawned" is set to true by default before assigning it inside the " WorkQueue::Enqueue" method. So the worker thread for API Listener relay message has not created and that caused the issue.
I can confirm it by print some debug statements & used the manual builds. It wasn't happening always, but for sometime when stop & start icinga in one of the node and unable to find the exact scenarios as the result is indeterminate. Due to that sometimes the OOM (Out Of Memorymanagement) killer kills icinga because of it took more memory.
Does anybody having the same issue?, Currently I am using my lab instance to test the cluster performance with 6000+ hosts & 38000+ services. All using the check_dummy plugin.
Please help me to resolve this.
Updated by mfriedrich on 2016-02-24 22:21:30 +00:00
Please re-test this with 2.4.3.
Updated by penyilas on 2016-03-03 12:16:55 +00:00
dnsmichi wrote:
Please re-test this with 2.4.3.
Hi,
I've tested it with CentOS7/icinga 2.4.3, and it's still not working.
With 3node slave zone, it looks, the "check_interval = 5s" checks are running in every 15s.
Updated by mfriedrich on 2016-03-04 15:54:15 +00:00
Updated by mfriedrich on 2016-04-14 10:13:33 +00:00
Updated by mfriedrich on 2016-07-28 16:14:45 +00:00
We've encountered a similar problem with 4 endpoints in a zone. Our current suggestion is to use 2 endpoints for now until a proper investigation and fix will happen.
Updated by mfriedrich on 2016-07-28 16:15:06 +00:00
Updated by mfriedrich on 2016-08-19 07:09:50 +00:00
Updated by mfriedrich on 2016-11-09 14:52:15 +00:00
Updated by mfriedrich on 2017-01-09 15:29:31 +00:00
Any news?
Would also like to see an update on this.
If you simply want to have a group of 'clients' or '_workers_' in a pool, you can't. This seems like the simplest of cluster setups which should easily be supported?
It should. If you can help us fix the issue, i.e. by looking into the current message loop and proposing a fix, or granting us time to look into it, we can speed things up here.
I've not done much contribution before, I can take a look through the stack traces above and try correlate what is going on? Unfortunately C is not my strong point.
How is the status here?
Hello dear core devs and a happy new year 2018!
while evaluating monitoring "topology" for my private setup I tried (and failed) to reproduce this. (I2 2.6 on Debian 9)
Just for inspiration: What about testing "dis-meshing" the respective zones (by zones.conf)? E.g.:
A -> D <- B
C -> D
instead of
A <-> B <-> C <-> D <-> A <-> C
B <-> D
Best,
AK
So this has been open for 3.5 years now. I'm new to using Icinga, but is the workaround here to make a separate zone and endpoint for each server I want monitored, or else look at NRPE or similar?
I'm still trying to get a sense of how to deploy a simple setup with a master and several clients that can do more than just basic external checks, but I thought the idea was to have a zone for all clients that aren't the master.
Our workaround was simply to shift the bulk of our monitoring (Linux + services) from Icinga2/graphite to Prometheus. We still use Icinga2 for network devices (several thousand), but we just run standalone boxes, or simple 2-node clusters (1 in each geo-location), all of which write to a shared graphite cluster.
My experience is that it's easier to just big one big Icinga2 box, and separate out things like MySQL to somewhere highly performant. Clustering for load balancing is a bit of a pain, and if you want to have more than 2, the bug in this thread stops it working. It'd be nice to see it fixed some day so you can just add workers in to a pool (or spin up a container) and voila, but we'll see if anyone wants to sponsor it.
thought the idea was to have a zone for all clients that aren't the master.
No, that won't work by design. Zones exist for High availability amongst zone members, and to separate specific tasks into roles - master, satellites, clients/agents. Agreed that it is cumbersome that a client would need a single zone and endpoint, but unless someone comes up with a better solution for this, it will be the one thing you need in the future.
So this has been open for 3.5 years now.
The thing I don't like in issues is when people tell about the age of issues which implicitly blames developers. It doesn't really matter whether a ticket has an age of one month or five years - if there are no solutions, no-one willing to work on this, nor any support requirements, nor any sponsors really requiring it then issues like this won't receive much love. Consider that Icinga is open source software, not something you'll pay for.
Anyhow, I am aware of the problem, I know that it is somewhere hidden in our routing algorithm. Heck, I haven't found it yet, neither did my colleagues.
If you have more details or a reliable test-setup (Vagrant, Docker, etc.) where this always happens, and you can provide all the debug logs, gdb backtraces and insights to work on a fix, please do so. You can also dive into the code, I've recently improved the development docs even more. If you want us developers do it, kindly request a quote for sponsoring.
It'd be nice to see it fixed some day so you can just add workers in to a pool (or spin up a container)
Having this issue fixed won't enable you to spin up 10 endpoints in a zone. By design, all these endpoints need to communicate with each other, and they will balance the checks amongst them. While it should work, a general pool of "dumb workers" is not what's built into the cluster design with using one binary with different roles defined by configuration and zone trees.
If you want something like that, this needs a more fine granular approach with e.g. disabling all features except for checker/api and then optimize this again for speed and better round-robin / balancing algorithms.
The feature request with check groups in #7160 moves into this direction for example. That being said, this issue and the idea of pooling/grouping is known but no-one is actively working on a concept nor a PoC at the moment.
Cheers,
Michael
Besides, for loadbalancing purposes it's very easy with Icinga2 to setup Icinga2 clients as dedicated "checker satellites". Such a client would indeed only have the checker/api feature active and acts as command_endpoint for certain service checks.
E.g. we had a memory problem with the check_wmi_plus plugin, so we couldn't run all our Windows service checks on the Icinga2 server. Instead we dedicated one Icinga2 client on separate hosts to one Windows service check (cpu, paging, disk, etc.). So it's absolutely no problem to horizontally scale out the load to Icinga2 clients/satellites.
Regarding high availability it would be indeed nice to have the possibility to have more than two nodes in an Icinga2 server cluster but a two node cluster (which works perfectly fine) is already a quite good HA solution.
Anyhow, I am aware of the problem, I know that it is somewhere hidden in our routing algorithm. Heck, I haven't found it yet, neither did my colleagues.
Can you tell me where (in which file/files) the routing algorithm is located?
lib/remote - apilistener, jsonrpc and partially lib/icinga - clusterevents*
I'm running a setup with a Master zone with 1 node managing a slave zone of 10 satellites for running plugins and it works.
Meanwhile, the routing has been documented at https://icinga.com/docs/icinga2/latest/doc/19-technical-concepts/#cluster-message-routing
Hi, is there any news about this issue please? (2 masters, many zones and more than 2 satellites per zone)?
Thanks!
Most helpful comment
Would also like to see an update on this.
If you simply want to have a group of 'clients' or '_workers_' in a pool, you can't. This seems like the simplest of cluster setups which should easily be supported?