Elasticsearch: [CI] Test failure in MachineLearningIT#testDeleteExpiredData

Created on 10 Apr 2019  路  11Comments  路  Source: elastic/elasticsearch

Despite previous attempts to fix this (https://github.com/elastic/elasticsearch/pull/40973) the test still fails regularly on the oraclelinux-6 builds in CI

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+multijob-unix-compatibility/os=oraclelinux-6/lastCompletedBuild/testReport/

Excerpts from the logs

bash-3.2$ grep test.delete.expired.data elasticsearch-7.1.0-SNAPSHOT/logs/client_rest-high-level_integTestCluster.log
[2019-04-10T09:21:27,783][INFO ][o.e.c.m.MetaDataCreateIndexService] [node-0] [test-delete-expired-data-data] creating index, cause [api], templates [], shards [1]/[1], mappings [_doc]
[2019-04-10T09:21:28,587][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] Opening job [test-delete-expired-data]
[2019-04-10T09:21:28,626][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] [test-delete-expired-data] Loading model snapshot [N/A], job latest_record_timestamp [N/A]
[2019-04-10T09:21:28,797][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [test-delete-expired-data] [autodetect/151075] [CResourceMonitor.cc@68] Setting model memory limit to 1024 MB
[2019-04-10T09:21:28,822][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] Successfully set job state to [opened] for job [test-delete-expired-data]
[2019-04-10T09:21:28,956][INFO ][o.e.x.m.d.DatafeedJob    ] [node-0] [test-delete-expired-data] Datafeed started (from: 1970-01-01T00:00:00.000Z to: 2019-04-09T09:21:27.882Z) with frequency [1000ms]
[2019-04-10T09:21:29,110][INFO ][o.e.x.m.d.DatafeedJob    ] [node-0] [test-delete-expired-data] Lookback has finished
[2019-04-10T09:21:29,110][INFO ][o.e.x.m.d.DatafeedManager] [node-0] [no_realtime] attempt to stop datafeed [test-delete-expired-data-feed] for job [test-delete-expired-data]
[2019-04-10T09:21:29,110][INFO ][o.e.x.m.d.DatafeedManager] [node-0] [no_realtime] try lock [20s] to stop datafeed [test-delete-expired-data-feed] for job [test-delete-expired-data]...
[2019-04-10T09:21:29,110][INFO ][o.e.x.m.d.DatafeedManager] [node-0] [no_realtime] stopping datafeed [test-delete-expired-data-feed] for job [test-delete-expired-data], acquired [true]...
[2019-04-10T09:21:29,111][INFO ][o.e.x.m.d.DatafeedManager] [node-0] [no_realtime] datafeed [test-delete-expired-data-feed] for job [test-delete-expired-data] has been stopped
[2019-04-10T09:21:29,160][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] Closing job [test-delete-expired-data], because [close job (api)]
[2019-04-10T09:21:29,162][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [test-delete-expired-data] [autodetect/151075] [CCmdSkeleton.cc@45] Handled 240 records
[2019-04-10T09:21:29,162][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [test-delete-expired-data] [autodetect/151075] [CAnomalyJob.cc@1369] Pruning all models
[2019-04-10T09:21:29,183][INFO ][o.e.x.m.p.AbstractNativeProcess] [node-0] [test-delete-expired-data] State output finished
[2019-04-10T09:21:29,224][INFO ][o.e.x.m.j.p.a.o.AutoDetectResultProcessor] [node-0] [test-delete-expired-data] 23 buckets parsed from autodetect output
[2019-04-10T09:21:29,281][INFO ][o.e.x.m.j.p.a.AutodetectCommunicator] [node-0] [test-delete-expired-data] job closed
[2019-04-10T09:21:30,161][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] Opening job [test-delete-expired-data]
[2019-04-10T09:21:30,166][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] [test-delete-expired-data] Loading model snapshot [-13] with latest_record_timestamp [2019-04-09T08:21:27.000Z], job latest_record_timestamp [2019-04-09T08:21:27.882Z]
[2019-04-10T09:21:30,169][INFO ][o.e.x.m.j.p.a.NativeAutodetectProcessFactory] [node-0] Restoring quantiles for job 'test-delete-expired-data'
[2019-04-10T09:21:30,337][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [test-delete-expired-data] [autodetect/151079] [CResourceMonitor.cc@68] Setting model memory limit to 1024 MB
[2019-04-10T09:21:30,363][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] Successfully set job state to [opened] for job [test-delete-expired-data]
[2019-04-10T09:21:30,382][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [test-delete-expired-data] [autodetect/151079] [CAnomalyJob.cc@811] Processing is already complete to time 1554796800
[2019-04-10T09:21:30,405][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [test-delete-expired-data] [autodetect/151079] [CForecastRunner.cc@124] Start forecasting from 2019-04-09T08:00:00+0000 to 2019-04-09T11:00:00+0000
[2019-04-10T09:21:30,406][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [test-delete-expired-data] [autodetect/151079] [CForecastRunner.cc@217] Finished forecasting, wrote 3 records
[2019-04-10T09:21:31,533][INFO ][o.e.x.m.d.DatafeedJob    ] [node-0] [test-delete-expired-data] Datafeed started (from: 2019-04-09T08:21:27.883Z to: 2019-04-10T09:21:27.882Z) with frequency [1000ms]
[2019-04-10T09:21:31,731][INFO ][o.e.x.m.d.DatafeedJob    ] [node-0] [test-delete-expired-data] Lookback has finished
[2019-04-10T09:21:31,731][INFO ][o.e.x.m.d.DatafeedManager] [node-0] [no_realtime] attempt to stop datafeed [test-delete-expired-data-feed] for job [test-delete-expired-data]
[2019-04-10T09:21:31,731][INFO ][o.e.x.m.d.DatafeedManager] [node-0] [no_realtime] try lock [20s] to stop datafeed [test-delete-expired-data-feed] for job [test-delete-expired-data]...
[2019-04-10T09:21:31,731][INFO ][o.e.x.m.d.DatafeedManager] [node-0] [no_realtime] stopping datafeed [test-delete-expired-data-feed] for job [test-delete-expired-data], acquired [true]...
[2019-04-10T09:21:31,732][INFO ][o.e.x.m.d.DatafeedManager] [node-0] [no_realtime] datafeed [test-delete-expired-data-feed] for job [test-delete-expired-data] has been stopped
[2019-04-10T09:21:31,782][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] Closing job [test-delete-expired-data], because [close job (api)]
[2019-04-10T09:21:31,783][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [test-delete-expired-data] [autodetect/151079] [CCmdSkeleton.cc@45] Handled 330 records
[2019-04-10T09:21:31,783][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [test-delete-expired-data] [autodetect/151079] [CAnomalyJob.cc@1369] Pruning all models
[2019-04-10T09:21:31,800][INFO ][o.e.x.m.p.AbstractNativeProcess] [node-0] [test-delete-expired-data] State output finished
[2019-04-10T09:21:31,836][INFO ][o.e.x.m.j.p.a.o.AutoDetectResultProcessor] [node-0] [test-delete-expired-data] 25 buckets parsed from autodetect output
[2019-04-10T09:21:32,174][INFO ][o.e.x.m.j.p.a.AutodetectCommunicator] [node-0] [test-delete-expired-data] job closed
[2019-04-10T09:21:32,876][INFO ][o.e.x.m.a.TransportDeleteJobAction] [node-0] Running DBQ on [.ml-anomalies-shared] for job [test-delete-expired-data]
[2019-04-10T09:21:33,008][INFO ][o.e.x.m.a.TransportDeleteJobAction] [node-0] Job [test-delete-expired-data] deleted
[2019-04-10T09:21:33,021][INFO ][o.e.c.m.MetaDataDeleteIndexService] [node-0] [test-delete-expired-data-data/DhVgmgbrQuCKfImm_q3XEg] deleting index

Loading model snapshot [-13] with latest_record_timestamp in particular indicates a problem with the snapshot IDs

:ml >bug >test-failure v6.7.2 v7.0.0 v7.2.0 v8.0.0

Most helpful comment

The "oraclelinux" is a good suspicion I think, the latest one before muting was this: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=oraclelinux-6/325/console

All 11 comments

Pinging @elastic/ml-core

The "oraclelinux" is a good suspicion I think, the latest one before muting was this: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=oraclelinux-6/325/console

Like the 7.x build in the issue description, the logs in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=oraclelinux-6/325/gcsObjects/ also have Loading model snapshot [-13] in the server-side log relating to this test.

Figuring out how Oracle Linux manages to consistently produce a model snapshot ID of -13 is the key to solving this issue.

-13 was chosen for a different test too:

bash-3.2$ find . -name '*.log' -exec grep 'Loading model snapshot' '{}' ';' | grep -v 'N/A'
[2019-04-10T12:33:30,112][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] [test-start-datafeed] Loading model snapshot [-13] with latest_record_timestamp [2019-04-10T12:32:28.000Z], job latest_record_timestamp [2019-04-10T12:32:28.000Z]
[2019-04-10T12:33:55,235][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] [test-delete-expired-data] Loading model snapshot [-13] with latest_record_timestamp [2019-04-09T11:33:52.000Z], job latest_record_timestamp [2019-04-09T11:33:52.492Z]

We don't notice it in the other test because it only has an effect in tests that generate two model snapshots and expect to be able to differentiate them. But it's further evidence that Oracle Linux is consistently choosing -13.

The last Oracle Linux master build to succeed was 313, on 25th March. In the logs from https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=oraclelinux-6/313/gcsObjects/ we have:

bash-3.2$ find . -name '*.log' -exec grep 'Loading model snapshot' '{}' ';' | grep -v 'N/A'
[2019-03-26T03:28:20,308][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] [test-delete-expired-data] Loading model snapshot [1553570899] with latest_record_timestamp [2019-03-25T02:28:18.000Z], job latest_record_timestamp [2019-03-25T02:28:18.224Z]
[2019-03-26T03:28:29,317][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] [test-start-datafeed] Loading model snapshot [1553570909] with latest_record_timestamp [2019-03-26T03:27:29.000Z], job latest_record_timestamp [2019-03-26T03:27:29.000Z]

Snapshot IDs like this are what we'd expect.

By the time of build 314 the snapshot ID is -13. Build 313 was from commit 58e885c. Build 314 was from commit 2a60ba6. Both used the same ml-cpp commit, so if the problem is on the C++ side it must be caused by undefined behaviour rather than a code change.

Now this test is muted the -13s have just moved to other tests - see #41101.

The REPRO command is:

./gradlew :client:rest-high-level:integTest -Dtests.class=org.elasticsearch.client.MachineLearningIT -Dtests.method=testDeleteExpiredData

Two discoveries:

  1. The change that occurred between builds 313 and 314 is that the kernel on the Oracle Linux 6 CI workers changed from Linux 2.6.32-696.el6.x86_64 (amd64) to Linux 4.1.12-61.1.28.el6uek.x86_64 (amd64). The uek kernel is Oracle's "unbreakable enterprise kernel", which is specifically _not_ supported according to the support matrix.
  2. The -13 is being returned due to our system call filter (seccomp). If I create a build of the C++ without seccomp I get different timestamps. Unfortunately they're crazy timestamps like from the year 4977.

elastic/ml-cpp#459 should fix this, but then the test will need unmuting once the C++ change has propagated through the build system.

The test that failed in #41101 is not muted, so we can use that to confirm that the C++ fix works, and then I'll raise a PR to unmute this test after that.

Unfortunately they're crazy timestamps like from the year 4977

That turned out to be a mistake in my debugging code. The only real problem was the seccomp filter.

The -13 comes from:

#define EACCES      13  /* Permission denied */

The man page says:

On error, ((time_t) -1) is returned, and errno is set appropriately.

But I guess in the seccomp case it's setting the return value to ((time_t) -errno) instead.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

brwe picture brwe  路  3Comments

rpalsaxena picture rpalsaxena  路  3Comments

Praveen82 picture Praveen82  路  3Comments

clintongormley picture clintongormley  路  3Comments

ttaranov picture ttaranov  路  3Comments