If a job and its datafeed are force-deleted right after their creation, a situation can arise in which the internal ML indices are in an inconsistent state. The inconsistency seems to result from the following race condition:
.ml-anomalies-${jobId} and .ml-anomalies-.write-${jobId})..ml-anomalies-.write-${jobId}) to index that job stats document into.Given that this is a result of the way that the jobs are deleted, we have no means to ensure it doesn't happen. That leaves two possible mitigations:
The former point only prevents any inconsistencies caused by the Logs UI itself, but has little risk. The latter point also handles cases in which external influences have deleted the job in such a problematic way, but requires us to delete an index, which is an operation that by its nature has a higher risk of mistakes causing data loss.
Pinging @elastic/infra-logs-ui
There still seems to be an asynchronous part of the ML plugin, though, that writes job stats documents to the write alias
Can you remember what exactly was in this document? Was it a datafeed timing stats document?
This is a bug in the ML backend. It shouldn't behave this way. I suspect the problem is that we shouldn't be creating a datafeed timing stats document when a started datafeed is stopped as part of the force-delete process. But if the document that causes the problem is not a datafeed timing stats document then this theory is wrong and it must be something else.
@przemekwitek please keep an eye on this issue and see if there is anything that needs fixing related to the new datafeed timing stats.
That leaves two possible mitigations
Please use mitigation 1, i.e. stop the datafeed first. If you use the other option then you're breaching encapsulation by relying on an internal implementation detail of ML that may change in the future.
(Or maybe if we can diagnose and fix the ML backend issue quickly enough you won't need a mitigation.)
It is indeed a datafeed timings stats document. Here's the search hit I get:
{
"_index" : ".ml-anomalies-.write-kibana-logs-ui-felix-1-default-log-entry-rate",
"_type" : "_doc",
"_id" : "kibana-logs-ui-felix-1-default-log-entry-rate_datafeed_timing_stats",
"_score" : 1.0,
"_source" : {
"job_id" : "kibana-logs-ui-felix-1-default-log-entry-rate",
"search_count" : 2,
"bucket_count" : 0,
"total_search_time_ms" : 13543.0
}
}
ping @elastic/ml-core
Sorry, Zube wasn't showing me the earlier comments at first and so my response didn't take those into account. I deleted that response but in case anyone saw it briefly, I wanted to clarify. :)
Hopefully once elastic/elasticsearch#46485 has been backported to the 7.4 branch the underlying problem will be fixed and no mitigation will be necessary in the Logs UI. This should be backported in time for BC4. However, the mitigation of stopping the datafeed will not hurt if you still want to do it just in case elastic/elasticsearch#46485 doesn't fix the problem as expected.
Thanks for taking the time to fix it! :heart: I'll check after your PR has been merged.
Hopefully once elastic/elasticsearch#46485 has been backported to the 7.4 branch the underlying problem will be fixed and no mitigation will be necessary in the Logs UI. This should be backported in time for BC4.
FYI: It is now backported to 7.4
I couldn't reproduce it with the 7.4 BC4. There's still some uncertainty given the random nature of race conditions, but unless it can be reliably reproduced with any newer version I would consider this solved. Thanks again for the rapid fix!
Most helpful comment
FYI: It is now backported to 7.4