Icinga2: [dev.icinga.com #10435] checks start "lagging" when more than two endpoints in a zone

Created on 22 Oct 2015 · 26Comments · Source: Icinga/icinga2

This issue has been migrated from Redmine: https://dev.icinga.com/issues/10435

Created by penyilas on 2015-10-22 13:12:26 +00:00

Assignee: _(none)_
Status: _New_
Target Version: _(none)_
Last Update: _2016-11-09 14:52:15 +00:00 (in Redmine)_

Icinga Version: 2.3.11
Backport?: Not yet backported
Include in Changelog: 1

Hi,

After I was installed 2.3.11, i've attached back the two remining nodes into cluster and it looks fine at first...
But i've realized, the service checks start "lagging" eg. the time between two servicechecks is more than check_interval. Sometimes 3-4 times more.
Just for sure, i've added an nrpe date check with 5s check_interval, and the results are same like cluster heartbeat checks based on ITIL (usually 15-20s between two checks).

On a live environment sometimes (with usual 5m check_interval) the elapsed time is more than 20mins.

If I removed two nodes from zone (stopping them won't enough), everything's back on normal...

My test environment is same as before:
2 master (zone master), 3 satellites (zone icinga)
Configs are same like https://dev.icinga.org/issues/10131

I've attached stack traces, if you need anything else, just let me know...

Attachments

master1.gdb_bt.log.gz penyilas - _2015-10-22 12:53:25 +00:00_
master2.gdb_bt.log.gz penyilas - _2015-10-22 12:53:26 +00:00_
master1-logs.tgz penyilas - _2015-10-22 12:53:28 +00:00_
slave1.gdb_bt.log.gz penyilas - _2015-10-22 12:53:28 +00:00_
slave2.gdb_bt.log.gz penyilas - _2015-10-22 12:53:28 +00:00_
slave2-logs.tgz penyilas - _2015-10-22 12:53:31 +00:00_
slave3.gdb_bt.log.gz penyilas - _2015-10-22 12:53:31 +00:00_
master2-logs.tgz penyilas - _2015-10-22 12:53:33 +00:00_
slave1-logs.tgz.part1.rar penyilas - _2015-10-22 12:58:21 +00:00_
slave1-logs.tgz.part2.rar penyilas - _2015-10-22 12:58:58 +00:00_
workqueue.PNG sudeshkumar - _2016-01-14 14:58:09 +00:00_

Relations:

relates #10435
relates #10435

aredistributed bug help wanted needs-sponsoring

Source

icinga-migration

Most helpful comment

Would also like to see an update on this.

If you simply want to have a group of 'clients' or '_workers_' in a pool, you can't. This seems like the simplest of cluster setups which should easily be supported?

iDemonix on 31 May 2017

👍3

All 26 comments

Updated by sudeshkumar on 2016-01-14 14:58:31 +00:00

File added _workqueue.PNG_

I too have related issue. My setup is three node cluster in a single zone. By random the check results of any one of the node are not syncing. I have enabled debug log and confirmed that, the check is happening but the check results are not syncing.

I don't see relay message entries "notice/ApiListener: Relaying" in the debug log of affected node when this issue happened. When I gone through the code, it seems the check results are pushed into m_RelayQueue and not processed. Also I can see the workqueue size keeps on increasing in the affected node. Pls find the attached sreenshot.

icinga-migration on 14 Jan 2016

Updated by sudeshkumar on 2016-01-18 13:46:16 +00:00

For some reason the "m_Spawned" is set to true by default before assigning it inside the " WorkQueue::Enqueue" method. So the worker thread for API Listener relay message has not created and that caused the issue.

I can confirm it by print some debug statements & used the manual builds. It wasn't happening always, but for sometime when stop & start icinga in one of the node and unable to find the exact scenarios as the result is indeterminate. Due to that sometimes the OOM (Out Of Memorymanagement) killer kills icinga because of it took more memory.

Does anybody having the same issue?, Currently I am using my lab instance to test the cluster performance with 6000+ hosts & 38000+ services. All using the check_dummy plugin.

Please help me to resolve this.

icinga-migration on 18 Jan 2016

Updated by mfriedrich on 2016-02-24 22:21:30 +00:00

Status changed from _New_ to _Feedback_

Please re-test this with 2.4.3.

icinga-migration on 24 Feb 2016

Updated by penyilas on 2016-03-03 12:16:55 +00:00

dnsmichi wrote:

Please re-test this with 2.4.3.

Hi,

I've tested it with CentOS7/icinga 2.4.3, and it's still not working.
With 3node slave zone, it looks, the "check_interval = 5s" checks are running in every 15s.

icinga-migration on 3 Mar 2016

Updated by mfriedrich on 2016-03-04 15:54:15 +00:00

Parent Id set to _11313_

icinga-migration on 4 Mar 2016

Updated by mfriedrich on 2016-04-14 10:13:33 +00:00

Status changed from _Feedback_ to _New_

icinga-migration on 14 Apr 2016

Updated by mfriedrich on 2016-07-28 16:14:45 +00:00

We've encountered a similar problem with 4 endpoints in a zone. Our current suggestion is to use 2 endpoints for now until a proper investigation and fix will happen.

icinga-migration on 28 Jul 2016

Updated by mfriedrich on 2016-07-28 16:15:06 +00:00

Relates set to _11948_

icinga-migration on 28 Jul 2016

Updated by mfriedrich on 2016-08-19 07:09:50 +00:00

Priority changed from _Normal_ to _High_

icinga-migration on 19 Aug 2016

Updated by mfriedrich on 2016-11-09 14:52:15 +00:00

Parent Id deleted ~~11313~~

icinga-migration on 9 Nov 2016

Updated by mfriedrich on 2017-01-09 15:29:31 +00:00

Relates set to _13861_

icinga-migration on 9 Jan 2017

Any news?

jkroepke on 8 Feb 2017

Would also like to see an update on this.

If you simply want to have a group of 'clients' or '_workers_' in a pool, you can't. This seems like the simplest of cluster setups which should easily be supported?

iDemonix on 31 May 2017

👍3

It should. If you can help us fix the issue, i.e. by looking into the current message loop and proposing a fix, or granting us time to look into it, we can speed things up here.

dnsmichi on 31 May 2017

I've not done much contribution before, I can take a look through the stack traces above and try correlate what is going on? Unfortunately C is not my strong point.

iDemonix on 31 May 2017

How is the status here?

SimonHoenscheid on 27 Oct 2017

👎1 👍1

Hello dear core devs and a happy new year 2018!

while evaluating monitoring "topology" for my private setup I tried (and failed) to reproduce this. (I2 2.6 on Debian 9)

Just for inspiration: What about testing "dis-meshing" the respective zones (by zones.conf)? E.g.:

A -> D <- B
C -> D

instead of

A <-> B <-> C <-> D <-> A <-> C
B <-> D

Best,
AK

Al2Klimov on 7 Jan 2018

So this has been open for 3.5 years now. I'm new to using Icinga, but is the workaround here to make a separate zone and endpoint for each server I want monitored, or else look at NRPE or similar?

I'm still trying to get a sense of how to deploy a simple setup with a master and several clients that can do more than just basic external checks, but I thought the idea was to have a zone for all clients that aren't the master.

relrod on 17 May 2019

👍1

Our workaround was simply to shift the bulk of our monitoring (Linux + services) from Icinga2/graphite to Prometheus. We still use Icinga2 for network devices (several thousand), but we just run standalone boxes, or simple 2-node clusters (1 in each geo-location), all of which write to a shared graphite cluster.

My experience is that it's easier to just big one big Icinga2 box, and separate out things like MySQL to somewhere highly performant. Clustering for load balancing is a bit of a pain, and if you want to have more than 2, the bug in this thread stops it working. It'd be nice to see it fixed some day so you can just add workers in to a pool (or spin up a container) and voila, but we'll see if anyone wants to sponsor it.

iDemonix on 17 May 2019

thought the idea was to have a zone for all clients that aren't the master.

No, that won't work by design. Zones exist for High availability amongst zone members, and to separate specific tasks into roles - master, satellites, clients/agents. Agreed that it is cumbersome that a client would need a single zone and endpoint, but unless someone comes up with a better solution for this, it will be the one thing you need in the future.

So this has been open for 3.5 years now.

The thing I don't like in issues is when people tell about the age of issues which implicitly blames developers. It doesn't really matter whether a ticket has an age of one month or five years - if there are no solutions, no-one willing to work on this, nor any support requirements, nor any sponsors really requiring it then issues like this won't receive much love. Consider that Icinga is open source software, not something you'll pay for.

Anyhow, I am aware of the problem, I know that it is somewhere hidden in our routing algorithm. Heck, I haven't found it yet, neither did my colleagues.

If you have more details or a reliable test-setup (Vagrant, Docker, etc.) where this always happens, and you can provide all the debug logs, gdb backtraces and insights to work on a fix, please do so. You can also dive into the code, I've recently improved the development docs even more. If you want us developers do it, kindly request a quote for sponsoring.

It'd be nice to see it fixed some day so you can just add workers in to a pool (or spin up a container)

Having this issue fixed won't enable you to spin up 10 endpoints in a zone. By design, all these endpoints need to communicate with each other, and they will balance the checks amongst them. While it should work, a general pool of "dumb workers" is not what's built into the cluster design with using one binary with different roles defined by configuration and zone trees.

If you want something like that, this needs a more fine granular approach with e.g. disabling all features except for checker/api and then optimize this again for speed and better round-robin / balancing algorithms.

The feature request with check groups in #7160 moves into this direction for example. That being said, this issue and the idea of pooling/grouping is known but no-one is actively working on a concept nor a PoC at the moment.

Cheers,
Michael

dnsmichi on 17 May 2019

👍1

Besides, for loadbalancing purposes it's very easy with Icinga2 to setup Icinga2 clients as dedicated "checker satellites". Such a client would indeed only have the checker/api feature active and acts as command_endpoint for certain service checks.
E.g. we had a memory problem with the check_wmi_plus plugin, so we couldn't run all our Windows service checks on the Icinga2 server. Instead we dedicated one Icinga2 client on separate hosts to one Windows service check (cpu, paging, disk, etc.). So it's absolutely no problem to horizontally scale out the load to Icinga2 clients/satellites.
Regarding high availability it would be indeed nice to have the possibility to have more than two nodes in an Icinga2 server cluster but a two node cluster (which works perfectly fine) is already a quite good HA solution.

Corbyn on 17 May 2019

👍1

Anyhow, I am aware of the problem, I know that it is somewhere hidden in our routing algorithm. Heck, I haven't found it yet, neither did my colleagues.

Can you tell me where (in which file/files) the routing algorithm is located?

tkoeck on 12 Jun 2019

lib/remote - apilistener, jsonrpc and partially lib/icinga - clusterevents*

dnsmichi on 18 Jun 2019

I'm running a setup with a Master zone with 1 node managing a slave zone of 10 satellites for running plugins and it works.

DisSsha on 9 Sep 2019

Meanwhile, the routing has been documented at https://icinga.com/docs/icinga2/latest/doc/19-technical-concepts/#cluster-message-routing

dnsmichi on 5 Feb 2020

Hi, is there any news about this issue please? (2 masters, many zones and more than 2 satellites per zone)?
Thanks!

AurelienFo on 3 Dec 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

2.8.0: warning/JsonRpcConnection: Call to non-existent function 'event::UpdateRepository'

ggzengel · 4Comments

Problem with zones after upgrading to 2.11

ctrlaltca · 6Comments

Servicegroup table for livestatus is not refreshed when service is removed

mickenordin · 5Comments

String#substr has inconsistent / erroneous behaviour

seventh-chord · 5Comments

Icinga 2.11.0 RC1/Icingaweb 2.6.3: Icinga Web 2 does not properly detect Icinga 2

peteeckel · 4Comments