Icinga2: Verify and fix flapping detection

Created on 7 Feb 2017  路  11Comments  路  Source: Icinga/icinga2

This involves:

  • unit tests (proposed in #4150)
  • proper fix (partially available in #118)
  • documentation including how it works now
  • further tests

All remaining tickets are subject to being closed, this is the only working ticket.

  • #4150
  • #4746
  • #4243
  • #4903
  • #4898
  • #4899
arenotifications aretests bug queuimportant

Most helpful comment

@dnsmichi Just wondering if there is any progress on making flap detection working in Icinga? Is there anything we can do to help resolving the issue?

All 11 comments

Please eventually consider reordering your tasks. I'd suggest to first specify the expected behavior and then to checks whether it really works as expected. "documentation including how it works now" sounds the other way round ;-)

It was just an unsorted order. I now have failing unit tests (which require code changes too), patches which partially fix the original problem, and the desire to have it documented properly. I'll start with a review how it should work, of course ;)

Forgot an issue: #4899

@Thomas-Gelf
"documentation including how it works now" exists in PR #118

fix/flapping-4982 contains the current review state. Unfortunately the fix from #118 does not fully fix the issue and added tests are failing.

@dnsmichi Just wondering if there is any progress on making flap detection working in Icinga? Is there anything we can do to help resolving the issue?

Sorry to ask. I would also be interested how the progress is.

Alright, after looking into it a bit yesterday I found the algorithm to be working as expected (kinda)

  • Flapping percentage degrades too slowly, 15-20 checks with the same result to leave flapping is too much
  • Possibly the error is not in our flapping detection but in the handling
  • Hard/Soft state changes are kinda weird
  • At least one of the tests is borked

We should also think about using the "old" flapping, which looks at the last 20 state changes only.

Go for it.

The pull request can be merged imo.

Things I tested:

  • Boosttests for: stable object, flapping object, the example from the docs
  • Tested FlappingStart and FlappingEnd notifications

Because of the way we handle new objects the first check will always result in a state change. But this should not be an issue in production.

See documentation for further information.

Was this page helpful?
0 / 5 - 0 ratings