Elastalert: Flatline with no matches

Created on 12 Feb 2018 · 18Comments · Source: Yelp/elastalert

Hi, I'm clearly missing something simple but:

my_rule.yaml

type: flatline
index: index-*
threshold: 1
timeframe:
  hours: 24
use_count_query: true
doc_type: doc
filter:
- query:
    query_string:
      query: "application:\"nonsense\""

I would expect that: elastalert-test-rule my_rule.yaml would say something like "An abnormally low number of events .." because obviously there are no events with field and value "nonsense" and there will never be. However, if I change timeframe to hours:1, strangely it hits the rule and says "An abnormally low ...". To make sure I created that original rule and left for days, but still no alerts...

Source

povils

👍2

Most helpful comment

Wow what a coincidence. We missed an outage this week because our dirt simple flatline rule refused to fire as per the documentation says it should. Now the client is asking why we missed it.

While I appreciate the author work and willingness open source elastalert this is very frustrating.

I have wasted hours re-testing this with elastalert-test-rule and it only fires if --start is set for over 36 hours prior but it won't fire when running in production.

If my look back is set to 24 hours and my threshold is 3 and there have been no indexes logged for 12 hours prior then should have fired and should refire if I erase all the elastalert metadata and restart, but it doesn't. At least that is how the documentation describes it. It should be that simple.

Maybe this rule should marked as being an incomplete or broken so people don't trust being alerted to production system outage by using it until it get fixed. Because it is definitely broken.

Also the results of running elastalert-test-rule and the actual running in production should be exactly the same.

I know the documentation mentions that the results could be different, but if that the case then running elastalert-test-rule is worthless because it not a valid test unless it produces the exact same result just like a unit test

JungleGenius on 16 Jun 2019

👍2

All 18 comments

Same issue here, left it running, still no alerts unless i set it to 1-23 hour. When set to 24 hours no alerts happen. My logstash indexes are in the following format, it is probably related to that: logstash-%Y.%m.%d

high-stakes on 29 May 2018

👍1

Does it trigger an alert immediately if you add --start 2018-05-28 where that date is 24+ hours ago?

I don't think there is anything special about 24 hours exactly, but I guess it's possible. Are you sure you waited a full 24 hours and there was no matching documents?

Qmando on 29 May 2018

But I left that alert for days and nothing or missing something ?

povils on 29 May 2018

@Qmando , If i set it to 23 hours it triggers immediately, when i change it to 24 hours it no longer triggers. Waited a weekend otherwise for it to trigger without success. And I remember this used to be working in the past (same version and config) then it stopped working. I deleted the index and metadata without luck though.

high-stakes on 30 May 2018

You are on version 0.1.31? I'll try to reproduce this and get back to yall.

Qmando on 30 May 2018

Not the latest version, i will upgrade tomorrow and post my rule configuration, thanks.

high-stakes on 30 May 2018

@Qmando I upgraded to 0.1.31, still having the same error

Here is my configuration:

type: flatline

index: logstash-%Y.%m.%d
use_strftime_index: true

threshold: 1
timeframe:
  minutes: 1445 # 24hours + 5 minutes
use_count_query: true

realert:
  days: 3

filter:
- query:
    match:
      type: "audit"

doc_type: "audit"

If I set time-frame to 23 hours, it works right away, 24 hours nothing.
I tried removing use_count_query, realerting config, doc_type, used index matching like logstash-* without any difference.

high-stakes on 30 May 2018

I can't reproduce this using your rule config:

$ python -m elastalert.elastalert --rule test.yaml --debug --start 2018-05-29
INFO:elastalert:Note: In debug mode, alerts will be logged to console but NOT actually sent.
                To send them but remain verbose, use --verbose instead.
INFO:elastalert:Starting up
INFO:elastalert:Queried rule dffgdfgd from 2018-05-28 17:00 PDT to 2018-05-28 18:00 PDT: 0 hits
.....
INFO:elastalert:Queried rule dffgdfgd from 2018-05-30 11:00 PDT to 2018-05-30 11:08 PDT: 0 hits
INFO:elastalert:Skipping writing to ES: {'rule_name': u'dffgdfgd.all', '@timestamp': '2018-05-30T18:08:43.241762Z', 'exponent': 0, 'until': '2018-06-02T18:08:43.241754Z'}
INFO:elastalert:Alert for dffgdfgd at 2018-05-30T02:00:00Z:
INFO:elastalert:dffgdfgd

An abnormally low number of events occurred around 2018-05-29 19:00 PDT.
Between 2018-05-28 18:55 PDT and 2018-05-29 19:00 PDT, there were less than 1 events.

@timestamp: 2018-05-30T02:00:00Z
count: 0
key: all
num_hits: 0
num_matches: 36

INFO:elastalert:Ignoring match for silenced rule dffgdfgd.all
...

Same exact thing if I try 23 hours. Is this not what you are doing? Can you show logs from when your 23 hour timeframe works right away and 24 hours doesnt?

Qmando on 30 May 2018

Hi,

So i was using elastalert test when the 23 hour timeframe setting succeeded, otherwise i see it does not work just like with 24 hours.

1. When I use 23 or 24 hours without specifying --start:

INFO:elastalert:Queried rule [int] No xyz Logs arrive from 2018-06-01 14:03 CEST to 2018-06-01 14:06 CEST: 0 / 0 hits
INFO:elastalert:Ran [int] No xyz Logs arrive from 2018-06-01 14:03 CEST to 2018-06-01 14:06 CEST: 0 query hits (0 already seen), 0 matches, 0 alerts sent

2. When I specify --start 2018-05-30 then SOMETIMES it says it is silenced but only sometimes

....
INFO:elastalert:Ignoring match for silenced rule [int] No xyz Logs arrive.all
INFO:elastalert:Ignoring match for silenced rule [int] No xyz Logs arrive.all
INFO:elastalert:Ran [int] No AdnIDM Logs arrive from 2018-05-30 02:00 CEST to 2018-06-01 14:03 CEST: 0 query hits (0 already seen), 149 matches, 0 alerts sent

3. Sometimes (randomly) it says "no matches" found instead of saying the rules is silenced (which it should not be because there were no alerts whatsoever).

4. When I run "elastalert-test-rule rules/xyz_logs_flatline.yaml --alert --config config.yaml" with 23 hours

Would have written the following documents to writeback index (default is elastalert_status):

silence - {'rule_name': u'[int] No AdnIDM Logs arrive.all', '@timestamp': datetime.datetime(2018, 6, 1, 12, 20, 35, 719367, tzinfo=tzutc()), 'exponent': 0, 'until': datetime.datetime(2018, 6, 1, 15, 20, 35, 719357, tzinfo=tzutc())}

elastalert - {'alert_info': {'type': 'email', 'recipients': ['xyz']}, 'alert_sent': True, 'match_body': {'count': 0, 'num_hits': 0, '@timestamp': '2018-06-01T11:35:35.385220Z', 'key': 'all', 'num_matches': 4}, 'rule_name': '[int] No xyz Logs arrive', 'match_time': '2018-06-01T11:35:35.385220Z', 'alert_time': datetime.datetime(2018, 6, 1, 12, 20, 35, 719526, tzinfo=tzutc())}

elastalert_status - {'hits': 0, 'matches': 4, '@timestamp': datetime.datetime(2018, 6, 1, 12, 20, 36, 358059, tzinfo=tzutc()), 'rule_name': '[int] No xyz Logs arrive', 'starttime': datetime.datetime(2018, 5, 31, 12, 20, 35, 385220, tzinfo=tzutc()), 'endtime': datetime.datetime(2018, 6, 1, 12, 20, 35, 385220, tzinfo=tzutc()), 'time_taken': 0.9462141990661621}

5. When I run "elastalert-test-rule rules/xyz_logs_flatline.yaml --alert --config config.yaml" with 24 hours

Would have written the following documents to writeback index (default is elastalert_status):

elastalert_status - {'hits': 0, 'matches': 0, '@timestamp': datetime.datetime(2018, 6, 1, 12, 30, 5, 675743, tzinfo=tzutc()), 'rule_name': '[int] No xyz Logs arrive', 'starttime': datetime.datetime(2018, 5, 31, 12, 30, 5, 353628, tzinfo=tzutc()), 'endtime': datetime.datetime(2018, 6, 1, 12, 30, 5, 353628, tzinfo=tzutc()), 'time_taken': 0.2949259281158447}

high-stakes on 1 Jun 2018

Ty for the info. I'll take a look at this again.

Qmando on 1 Jun 2018

Has anyone found a resolution to this? My rule is set up like so:

name: {{ region_env_name }} Number of calls (5 min)
type: flatline
index: callflows-*
threshold: 1
timeframe:
    minutes: 5
use_count_query: true
doc_type: doc

filter:
    - query:
        query_string:
            query: "_exists_:CVPAppName AND CVPAppName:APP AND (CallType:7526 OR 7525)"

alert:
    - "sns"
sns_topic_arn: "SNS_ARN"

From reading the documents I would think Flatline would send an alert because there are no hits matching the query above, however the elastalert if not firing.

jberto78 on 1 May 2019

Can you post logs? Run elastalert with --verbose for at least 5 minutes.

Qmando on 1 May 2019

Hi Qmando, it actually ended up working, I just didn't give it enough time for elastalert to process the alert I guess. Thanks for following up.

jberto78 on 2 May 2019

Same issue here.
pip show elastalert Name: elastalert Version: 0.0.75

Version upgraded to 0.1.29 makes the same.

```# --- End Global Rule Configuration ---
scan_entire_timeframe: true

--- Begin Type Specific Rule Configuration ---

type: flatline
timeframe:
days: 7
run_every:
minutes: 10
threshold: 1
use_count_query: true
```

No matches were found even having not reached the threshold.
The index pattern we doing is index: index-*
Have checked few things already like the buffer_time

Letting elastalert daemon configured with defaults doesn't alert either. Other flatline alerts are doing fine

mariobede on 16 Jun 2019

Update:

python -m elastalert.elastalert --rule test.yaml --debug --start 2019-06-08 works fine but doesn't when it is running as a daemon.

How is the start affecting to the search? Is it same as timeframe set in the config file?

timeframe:
days: 7

mariobede on 16 Jun 2019

Wow what a coincidence. We missed an outage this week because our dirt simple flatline rule refused to fire as per the documentation says it should. Now the client is asking why we missed it.

While I appreciate the author work and willingness open source elastalert this is very frustrating.

I have wasted hours re-testing this with elastalert-test-rule and it only fires if --start is set for over 36 hours prior but it won't fire when running in production.

Maybe this rule should marked as being an incomplete or broken so people don't trust being alerted to production system outage by using it until it get fixed. Because it is definitely broken.

Also the results of running elastalert-test-rule and the actual running in production should be exactly the same.

JungleGenius on 16 Jun 2019

👍2

name: "Testing 123 Potential System Outage"

http_post_static_payload:
alert_name: "Testing 123 Potential System Outage"
alert_form: "PotentialSystemOutageAlarm"

http_post_all_values: True

verify_certs: False

index: agent-publicipaddress-*

type: flatline

query_key: _index

doc_type: publicipaddress

threshold: 1