Kibana: [Logs UI] Force-deleting jobs right after their creation can leave ML indices in an inconsistent state

Created on 3 Sep 2019 · 9Comments · Source: elastic/kibana

If a job and its datafeed are force-deleted right after their creation, a situation can arise in which the internal ML indices are in an inconsistent state. The inconsistency seems to result from the following race condition:

When the job is created, correspond read- and write-aliases containing the job id are also created (.ml-anomalies-${jobId} and .ml-anomalies-.write-${jobId}).
The write index is apparently used internally by the ML plugin when writing results and stats documents.
When the job is force-deleted right after its creation, the corresponding aliases are also removed rather quickly.
There still seems to be an asynchronous part of the ML plugin, though, that writes job stats documents to the write alias. Since that alias was already removed, Elasticsearch creates a new index with that name (.ml-anomalies-.write-${jobId}) to index that job stats document into.
The presence of this index blocks the creation of another job with the same job id until the index is manually deleted.

Given that this is a result of the way that the jobs are deleted, we have no means to ensure it doesn't happen. That leaves two possible mitigations:

When the Logs UI job management hook deletes the jobs, stop the datafeed first as this seems to prevent this asynchronous stats document from being written.
Delete that faulty index behind the scenes to free up the job id.

The former point only prevents any inconsistencies caused by the Logs UI itself, but has little risk. The latter point also handles cases in which external influences have deleted the job in such a problematic way, but requires us to delete an index, which is an operation that by its nature has a higher risk of mistakes causing data loss.

Logs UI logs-metrics-ui bug discuss

Source

weltenwort

Most helpful comment

Hopefully once elastic/elasticsearch#46485 has been backported to the 7.4 branch the underlying problem will be fixed and no mitigation will be necessary in the Logs UI. This should be backported in time for BC4.

FYI: It is now backported to 7.4

przemekwitek on 9 Sep 2019

❤2 👀1

All 9 comments

Pinging @elastic/infra-logs-ui

elasticmachine on 3 Sep 2019

There still seems to be an asynchronous part of the ML plugin, though, that writes job stats documents to the write alias

Can you remember what exactly was in this document? Was it a datafeed timing stats document?

This is a bug in the ML backend. It shouldn't behave this way. I suspect the problem is that we shouldn't be creating a datafeed timing stats document when a started datafeed is stopped as part of the force-delete process. But if the document that causes the problem is not a datafeed timing stats document then this theory is wrong and it must be something else.

@przemekwitek please keep an eye on this issue and see if there is anything that needs fixing related to the new datafeed timing stats.

That leaves two possible mitigations

Please use mitigation 1, i.e. stop the datafeed first. If you use the other option then you're breaching encapsulation by relying on an internal implementation detail of ML that may change in the future.

(Or maybe if we can diagnose and fix the ML backend issue quickly enough you won't need a mitigation.)

droberts195 on 3 Sep 2019

It is indeed a datafeed timings stats document. Here's the search hit I get:

{
  "_index" : ".ml-anomalies-.write-kibana-logs-ui-felix-1-default-log-entry-rate",
  "_type" : "_doc",
  "_id" : "kibana-logs-ui-felix-1-default-log-entry-rate_datafeed_timing_stats",
  "_score" : 1.0,
  "_source" : {
    "job_id" : "kibana-logs-ui-felix-1-default-log-entry-rate",
    "search_count" : 2,
    "bucket_count" : 0,
    "total_search_time_ms" : 13543.0
  }
}

weltenwort on 3 Sep 2019

❤1

ping @elastic/ml-core

sophiec20 on 3 Sep 2019

Sorry, Zube wasn't showing me the earlier comments at first and so my response didn't take those into account. I deleted that response but in case anyone saw it briefly, I wanted to clarify. :)

jasonrhodes on 5 Sep 2019

Hopefully once elastic/elasticsearch#46485 has been backported to the 7.4 branch the underlying problem will be fixed and no mitigation will be necessary in the Logs UI. This should be backported in time for BC4. However, the mitigation of stopping the datafeed will not hurt if you still want to do it just in case elastic/elasticsearch#46485 doesn't fix the problem as expected.

droberts195 on 9 Sep 2019

❤1

Thanks for taking the time to fix it! :heart: I'll check after your PR has been merged.

weltenwort on 9 Sep 2019

Hopefully once elastic/elasticsearch#46485 has been backported to the 7.4 branch the underlying problem will be fixed and no mitigation will be necessary in the Logs UI. This should be backported in time for BC4.

FYI: It is now backported to 7.4

przemekwitek on 9 Sep 2019

❤2 👀1

I couldn't reproduce it with the 7.4 BC4. There's still some uncertainty given the random nature of race conditions, but unless it can be reliably reproduced with any newer version I would consider this solved. Thanks again for the rapid fix!

weltenwort on 12 Sep 2019

Was this page helpful?

0 / 5 - 0 ratings