Elasticsearch: [CI] Test failure in MachineLearningIT#testDeleteExpiredData

Created on 10 Apr 2019 · 11Comments · Source: elastic/elasticsearch

Despite previous attempts to fix this (https://github.com/elastic/elasticsearch/pull/40973) the test still fails regularly on the oraclelinux-6 builds in CI

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+multijob-unix-compatibility/os=oraclelinux-6/lastCompletedBuild/testReport/

Excerpts from the logs

bash-3.2$ grep test.delete.expired.data elasticsearch-7.1.0-SNAPSHOT/logs/client_rest-high-level_integTestCluster.log
[2019-04-10T09:21:27,783][INFO ][o.e.c.m.MetaDataCreateIndexService] [node-0] [test-delete-expired-data-data] creating index, cause [api], templates [], shards [1]/[1], mappings [_doc]
[2019-04-10T09:21:28,587][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] Opening job [test-delete-expired-data]
[2019-04-10T09:21:28,626][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] [test-delete-expired-data] Loading model snapshot [N/A], job latest_record_timestamp [N/A]
[2019-04-10T09:21:28,797][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [test-delete-expired-data] [autodetect/151075] [CResourceMonitor.cc@68] Setting model memory limit to 1024 MB
[2019-04-10T09:21:28,822][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] Successfully set job state to [opened] for job [test-delete-expired-data]
[2019-04-10T09:21:28,956][INFO ][o.e.x.m.d.DatafeedJob    ] [node-0] [test-delete-expired-data] Datafeed started (from: 1970-01-01T00:00:00.000Z to: 2019-04-09T09:21:27.882Z) with frequency [1000ms]
[2019-04-10T09:21:29,110][INFO ][o.e.x.m.d.DatafeedJob    ] [node-0] [test-delete-expired-data] Lookback has finished
[2019-04-10T09:21:29,110][INFO ][o.e.x.m.d.DatafeedManager] [node-0] [no_realtime] attempt to stop datafeed [test-delete-expired-data-feed] for job [test-delete-expired-data]
[2019-04-10T09:21:29,110][INFO ][o.e.x.m.d.DatafeedManager] [node-0] [no_realtime] try lock [20s] to stop datafeed [test-delete-expired-data-feed] for job [test-delete-expired-data]...
[2019-04-10T09:21:29,110][INFO ][o.e.x.m.d.DatafeedManager] [node-0] [no_realtime] stopping datafeed [test-delete-expired-data-feed] for job [test-delete-expired-data], acquired [true]...
[2019-04-10T09:21:29,111][INFO ][o.e.x.m.d.DatafeedManager] [node-0] [no_realtime] datafeed [test-delete-expired-data-feed] for job [test-delete-expired-data] has been stopped
[2019-04-10T09:21:29,160][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] Closing job [test-delete-expired-data], because [close job (api)]
[2019-04-10T09:21:29,162][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [test-delete-expired-data] [autodetect/151075] [CCmdSkeleton.cc@45] Handled 240 records
[2019-04-10T09:21:29,162][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [test-delete-expired-data] [autodetect/151075] [CAnomalyJob.cc@1369] Pruning all models
[2019-04-10T09:21:29,183][INFO ][o.e.x.m.p.AbstractNativeProcess] [node-0] [test-delete-expired-data] State output finished
[2019-04-10T09:21:29,224][INFO ][o.e.x.m.j.p.a.o.AutoDetectResultProcessor] [node-0] [test-delete-expired-data] 23 buckets parsed from autodetect output
[2019-04-10T09:21:29,281][INFO ][o.e.x.m.j.p.a.AutodetectCommunicator] [node-0] [test-delete-expired-data] job closed
[2019-04-10T09:21:30,161][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] Opening job [test-delete-expired-data]
[2019-04-10T09:21:30,166][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] [test-delete-expired-data] Loading model snapshot [-13] with latest_record_timestamp [2019-04-09T08:21:27.000Z], job latest_record_timestamp [2019-04-09T08:21:27.882Z]
[2019-04-10T09:21:30,169][INFO ][o.e.x.m.j.p.a.NativeAutodetectProcessFactory] [node-0] Restoring quantiles for job 'test-delete-expired-data'
[2019-04-10T09:21:30,337][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [test-delete-expired-data] [autodetect/151079] [CResourceMonitor.cc@68] Setting model memory limit to 1024 MB
[2019-04-10T09:21:30,363][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] Successfully set job state to [opened] for job [test-delete-expired-data]
[2019-04-10T09:21:30,382][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [test-delete-expired-data] [autodetect/151079] [CAnomalyJob.cc@811] Processing is already complete to time 1554796800
[2019-04-10T09:21:30,405][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [test-delete-expired-data] [autodetect/151079] [CForecastRunner.cc@124] Start forecasting from 2019-04-09T08:00:00+0000 to 2019-04-09T11:00:00+0000
[2019-04-10T09:21:30,406][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [test-delete-expired-data] [autodetect/151079] [CForecastRunner.cc@217] Finished forecasting, wrote 3 records
[2019-04-10T09:21:31,533][INFO ][o.e.x.m.d.DatafeedJob    ] [node-0] [test-delete-expired-data] Datafeed started (from: 2019-04-09T08:21:27.883Z to: 2019-04-10T09:21:27.882Z) with frequency [1000ms]
[2019-04-10T09:21:31,731][INFO ][o.e.x.m.d.DatafeedJob    ] [node-0] [test-delete-expired-data] Lookback has finished
[2019-04-10T09:21:31,731][INFO ][o.e.x.m.d.DatafeedManager] [node-0] [no_realtime] attempt to stop datafeed [test-delete-expired-data-feed] for job [test-delete-expired-data]
[2019-04-10T09:21:31,731][INFO ][o.e.x.m.d.DatafeedManager] [node-0] [no_realtime] try lock [20s] to stop datafeed [test-delete-expired-data-feed] for job [test-delete-expired-data]...
[2019-04-10T09:21:31,731][INFO ][o.e.x.m.d.DatafeedManager] [node-0] [no_realtime] stopping datafeed [test-delete-expired-data-feed] for job [test-delete-expired-data], acquired [true]...
[2019-04-10T09:21:31,732][INFO ][o.e.x.m.d.DatafeedManager] [node-0] [no_realtime] datafeed [test-delete-expired-data-feed] for job [test-delete-expired-data] has been stopped
[2019-04-10T09:21:31,782][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] Closing job [test-delete-expired-data], because [close job (api)]
[2019-04-10T09:21:31,783][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [test-delete-expired-data] [autodetect/151079] [CCmdSkeleton.cc@45] Handled 330 records
[2019-04-10T09:21:31,783][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [test-delete-expired-data] [autodetect/151079] [CAnomalyJob.cc@1369] Pruning all models
[2019-04-10T09:21:31,800][INFO ][o.e.x.m.p.AbstractNativeProcess] [node-0] [test-delete-expired-data] State output finished
[2019-04-10T09:21:31,836][INFO ][o.e.x.m.j.p.a.o.AutoDetectResultProcessor] [node-0] [test-delete-expired-data] 25 buckets parsed from autodetect output
[2019-04-10T09:21:32,174][INFO ][o.e.x.m.j.p.a.AutodetectCommunicator] [node-0] [test-delete-expired-data] job closed
[2019-04-10T09:21:32,876][INFO ][o.e.x.m.a.TransportDeleteJobAction] [node-0] Running DBQ on [.ml-anomalies-shared] for job [test-delete-expired-data]
[2019-04-10T09:21:33,008][INFO ][o.e.x.m.a.TransportDeleteJobAction] [node-0] Job [test-delete-expired-data] deleted
[2019-04-10T09:21:33,021][INFO ][o.e.c.m.MetaDataDeleteIndexService] [node-0] [test-delete-expired-data-data/DhVgmgbrQuCKfImm_q3XEg] deleting index

Loading model snapshot [-13] with latest_record_timestamp in particular indicates a problem with the snapshot IDs

:ml >bug >test-failure v6.7.2 v7.0.0 v7.2.0 v8.0.0

Source

edsavage

Most helpful comment

The "oraclelinux" is a good suspicion I think, the latest one before muting was this: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=oraclelinux-6/325/console

cbuescher on 10 Apr 2019

👍2

All 11 comments

Pinging @elastic/ml-core

elasticmachine on 10 Apr 2019

cbuescher on 10 Apr 2019

👍2

Like the 7.x build in the issue description, the logs in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=oraclelinux-6/325/gcsObjects/ also have Loading model snapshot [-13] in the server-side log relating to this test.

Figuring out how Oracle Linux manages to consistently produce a model snapshot ID of -13 is the key to solving this issue.

droberts195 on 10 Apr 2019

-13 was chosen for a different test too:

bash-3.2$ find . -name '*.log' -exec grep 'Loading model snapshot' '{}' ';' | grep -v 'N/A'
[2019-04-10T12:33:30,112][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] [test-start-datafeed] Loading model snapshot [-13] with latest_record_timestamp [2019-04-10T12:32:28.000Z], job latest_record_timestamp [2019-04-10T12:32:28.000Z]
[2019-04-10T12:33:55,235][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] [test-delete-expired-data] Loading model snapshot [-13] with latest_record_timestamp [2019-04-09T11:33:52.000Z], job latest_record_timestamp [2019-04-09T11:33:52.492Z]

We don't notice it in the other test because it only has an effect in tests that generate two model snapshots and expect to be able to differentiate them. But it's further evidence that Oracle Linux is consistently choosing -13.

droberts195 on 10 Apr 2019

The last Oracle Linux master build to succeed was 313, on 25th March. In the logs from https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=oraclelinux-6/313/gcsObjects/ we have:

bash-3.2$ find . -name '*.log' -exec grep 'Loading model snapshot' '{}' ';' | grep -v 'N/A'
[2019-03-26T03:28:20,308][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] [test-delete-expired-data] Loading model snapshot [1553570899] with latest_record_timestamp [2019-03-25T02:28:18.000Z], job latest_record_timestamp [2019-03-25T02:28:18.224Z]
[2019-03-26T03:28:29,317][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] [test-start-datafeed] Loading model snapshot [1553570909] with latest_record_timestamp [2019-03-26T03:27:29.000Z], job latest_record_timestamp [2019-03-26T03:27:29.000Z]

Snapshot IDs like this are what we'd expect.

droberts195 on 10 Apr 2019

By the time of build 314 the snapshot ID is -13. Build 313 was from commit 58e885c. Build 314 was from commit 2a60ba6. Both used the same ml-cpp commit, so if the problem is on the C++ side it must be caused by undefined behaviour rather than a code change.

droberts195 on 10 Apr 2019

Now this test is muted the -13s have just moved to other tests - see #41101.