Despite previous attempts to fix this (https://github.com/elastic/elasticsearch/pull/40973) the test still fails regularly on the oraclelinux-6 builds in CI
Excerpts from the logs
bash-3.2$ grep test.delete.expired.data elasticsearch-7.1.0-SNAPSHOT/logs/client_rest-high-level_integTestCluster.log
[2019-04-10T09:21:27,783][INFO ][o.e.c.m.MetaDataCreateIndexService] [node-0] [test-delete-expired-data-data] creating index, cause [api], templates [], shards [1]/[1], mappings [_doc]
[2019-04-10T09:21:28,587][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] Opening job [test-delete-expired-data]
[2019-04-10T09:21:28,626][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] [test-delete-expired-data] Loading model snapshot [N/A], job latest_record_timestamp [N/A]
[2019-04-10T09:21:28,797][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [test-delete-expired-data] [autodetect/151075] [CResourceMonitor.cc@68] Setting model memory limit to 1024 MB
[2019-04-10T09:21:28,822][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] Successfully set job state to [opened] for job [test-delete-expired-data]
[2019-04-10T09:21:28,956][INFO ][o.e.x.m.d.DatafeedJob ] [node-0] [test-delete-expired-data] Datafeed started (from: 1970-01-01T00:00:00.000Z to: 2019-04-09T09:21:27.882Z) with frequency [1000ms]
[2019-04-10T09:21:29,110][INFO ][o.e.x.m.d.DatafeedJob ] [node-0] [test-delete-expired-data] Lookback has finished
[2019-04-10T09:21:29,110][INFO ][o.e.x.m.d.DatafeedManager] [node-0] [no_realtime] attempt to stop datafeed [test-delete-expired-data-feed] for job [test-delete-expired-data]
[2019-04-10T09:21:29,110][INFO ][o.e.x.m.d.DatafeedManager] [node-0] [no_realtime] try lock [20s] to stop datafeed [test-delete-expired-data-feed] for job [test-delete-expired-data]...
[2019-04-10T09:21:29,110][INFO ][o.e.x.m.d.DatafeedManager] [node-0] [no_realtime] stopping datafeed [test-delete-expired-data-feed] for job [test-delete-expired-data], acquired [true]...
[2019-04-10T09:21:29,111][INFO ][o.e.x.m.d.DatafeedManager] [node-0] [no_realtime] datafeed [test-delete-expired-data-feed] for job [test-delete-expired-data] has been stopped
[2019-04-10T09:21:29,160][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] Closing job [test-delete-expired-data], because [close job (api)]
[2019-04-10T09:21:29,162][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [test-delete-expired-data] [autodetect/151075] [CCmdSkeleton.cc@45] Handled 240 records
[2019-04-10T09:21:29,162][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [test-delete-expired-data] [autodetect/151075] [CAnomalyJob.cc@1369] Pruning all models
[2019-04-10T09:21:29,183][INFO ][o.e.x.m.p.AbstractNativeProcess] [node-0] [test-delete-expired-data] State output finished
[2019-04-10T09:21:29,224][INFO ][o.e.x.m.j.p.a.o.AutoDetectResultProcessor] [node-0] [test-delete-expired-data] 23 buckets parsed from autodetect output
[2019-04-10T09:21:29,281][INFO ][o.e.x.m.j.p.a.AutodetectCommunicator] [node-0] [test-delete-expired-data] job closed
[2019-04-10T09:21:30,161][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] Opening job [test-delete-expired-data]
[2019-04-10T09:21:30,166][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] [test-delete-expired-data] Loading model snapshot [-13] with latest_record_timestamp [2019-04-09T08:21:27.000Z], job latest_record_timestamp [2019-04-09T08:21:27.882Z]
[2019-04-10T09:21:30,169][INFO ][o.e.x.m.j.p.a.NativeAutodetectProcessFactory] [node-0] Restoring quantiles for job 'test-delete-expired-data'
[2019-04-10T09:21:30,337][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [test-delete-expired-data] [autodetect/151079] [CResourceMonitor.cc@68] Setting model memory limit to 1024 MB
[2019-04-10T09:21:30,363][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] Successfully set job state to [opened] for job [test-delete-expired-data]
[2019-04-10T09:21:30,382][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [test-delete-expired-data] [autodetect/151079] [CAnomalyJob.cc@811] Processing is already complete to time 1554796800
[2019-04-10T09:21:30,405][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [test-delete-expired-data] [autodetect/151079] [CForecastRunner.cc@124] Start forecasting from 2019-04-09T08:00:00+0000 to 2019-04-09T11:00:00+0000
[2019-04-10T09:21:30,406][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [test-delete-expired-data] [autodetect/151079] [CForecastRunner.cc@217] Finished forecasting, wrote 3 records
[2019-04-10T09:21:31,533][INFO ][o.e.x.m.d.DatafeedJob ] [node-0] [test-delete-expired-data] Datafeed started (from: 2019-04-09T08:21:27.883Z to: 2019-04-10T09:21:27.882Z) with frequency [1000ms]
[2019-04-10T09:21:31,731][INFO ][o.e.x.m.d.DatafeedJob ] [node-0] [test-delete-expired-data] Lookback has finished
[2019-04-10T09:21:31,731][INFO ][o.e.x.m.d.DatafeedManager] [node-0] [no_realtime] attempt to stop datafeed [test-delete-expired-data-feed] for job [test-delete-expired-data]
[2019-04-10T09:21:31,731][INFO ][o.e.x.m.d.DatafeedManager] [node-0] [no_realtime] try lock [20s] to stop datafeed [test-delete-expired-data-feed] for job [test-delete-expired-data]...
[2019-04-10T09:21:31,731][INFO ][o.e.x.m.d.DatafeedManager] [node-0] [no_realtime] stopping datafeed [test-delete-expired-data-feed] for job [test-delete-expired-data], acquired [true]...
[2019-04-10T09:21:31,732][INFO ][o.e.x.m.d.DatafeedManager] [node-0] [no_realtime] datafeed [test-delete-expired-data-feed] for job [test-delete-expired-data] has been stopped
[2019-04-10T09:21:31,782][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] Closing job [test-delete-expired-data], because [close job (api)]
[2019-04-10T09:21:31,783][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [test-delete-expired-data] [autodetect/151079] [CCmdSkeleton.cc@45] Handled 330 records
[2019-04-10T09:21:31,783][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [test-delete-expired-data] [autodetect/151079] [CAnomalyJob.cc@1369] Pruning all models
[2019-04-10T09:21:31,800][INFO ][o.e.x.m.p.AbstractNativeProcess] [node-0] [test-delete-expired-data] State output finished
[2019-04-10T09:21:31,836][INFO ][o.e.x.m.j.p.a.o.AutoDetectResultProcessor] [node-0] [test-delete-expired-data] 25 buckets parsed from autodetect output
[2019-04-10T09:21:32,174][INFO ][o.e.x.m.j.p.a.AutodetectCommunicator] [node-0] [test-delete-expired-data] job closed
[2019-04-10T09:21:32,876][INFO ][o.e.x.m.a.TransportDeleteJobAction] [node-0] Running DBQ on [.ml-anomalies-shared] for job [test-delete-expired-data]
[2019-04-10T09:21:33,008][INFO ][o.e.x.m.a.TransportDeleteJobAction] [node-0] Job [test-delete-expired-data] deleted
[2019-04-10T09:21:33,021][INFO ][o.e.c.m.MetaDataDeleteIndexService] [node-0] [test-delete-expired-data-data/DhVgmgbrQuCKfImm_q3XEg] deleting index
Loading model snapshot [-13] with latest_record_timestamp in particular indicates a problem with the snapshot IDs
Pinging @elastic/ml-core
The "oraclelinux" is a good suspicion I think, the latest one before muting was this: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=oraclelinux-6/325/console
Like the 7.x build in the issue description, the logs in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=oraclelinux-6/325/gcsObjects/ also have Loading model snapshot [-13] in the server-side log relating to this test.
Figuring out how Oracle Linux manages to consistently produce a model snapshot ID of -13 is the key to solving this issue.
-13 was chosen for a different test too:
bash-3.2$ find . -name '*.log' -exec grep 'Loading model snapshot' '{}' ';' | grep -v 'N/A'
[2019-04-10T12:33:30,112][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] [test-start-datafeed] Loading model snapshot [-13] with latest_record_timestamp [2019-04-10T12:32:28.000Z], job latest_record_timestamp [2019-04-10T12:32:28.000Z]
[2019-04-10T12:33:55,235][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] [test-delete-expired-data] Loading model snapshot [-13] with latest_record_timestamp [2019-04-09T11:33:52.000Z], job latest_record_timestamp [2019-04-09T11:33:52.492Z]
We don't notice it in the other test because it only has an effect in tests that generate two model snapshots and expect to be able to differentiate them. But it's further evidence that Oracle Linux is consistently choosing -13.
The last Oracle Linux master build to succeed was 313, on 25th March. In the logs from https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=oraclelinux-6/313/gcsObjects/ we have:
bash-3.2$ find . -name '*.log' -exec grep 'Loading model snapshot' '{}' ';' | grep -v 'N/A'
[2019-03-26T03:28:20,308][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] [test-delete-expired-data] Loading model snapshot [1553570899] with latest_record_timestamp [2019-03-25T02:28:18.000Z], job latest_record_timestamp [2019-03-25T02:28:18.224Z]
[2019-03-26T03:28:29,317][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] [test-start-datafeed] Loading model snapshot [1553570909] with latest_record_timestamp [2019-03-26T03:27:29.000Z], job latest_record_timestamp [2019-03-26T03:27:29.000Z]
Snapshot IDs like this are what we'd expect.
By the time of build 314 the snapshot ID is -13. Build 313 was from commit 58e885c. Build 314 was from commit 2a60ba6. Both used the same ml-cpp commit, so if the problem is on the C++ side it must be caused by undefined behaviour rather than a code change.
Now this test is muted the -13s have just moved to other tests - see #41101.
The REPRO command is:
./gradlew :client:rest-high-level:integTest -Dtests.class=org.elasticsearch.client.MachineLearningIT -Dtests.method=testDeleteExpiredData
Two discoveries:
Linux 2.6.32-696.el6.x86_64 (amd64) to Linux 4.1.12-61.1.28.el6uek.x86_64 (amd64). The uek kernel is Oracle's "unbreakable enterprise kernel", which is specifically _not_ supported according to the support matrix.-13 is being returned due to our system call filter (seccomp). If I create a build of the C++ without seccomp I get different timestamps. Unfortunately they're crazy timestamps like from the year 4977.elastic/ml-cpp#459 should fix this, but then the test will need unmuting once the C++ change has propagated through the build system.
The test that failed in #41101 is not muted, so we can use that to confirm that the C++ fix works, and then I'll raise a PR to unmute this test after that.
Unfortunately they're crazy timestamps like from the year 4977
That turned out to be a mistake in my debugging code. The only real problem was the seccomp filter.
The -13 comes from:
#define EACCES 13 /* Permission denied */
The man page says:
On error,
((time_t) -1)is returned, anderrnois set appropriately.
But I guess in the seccomp case it's setting the return value to ((time_t) -errno) instead.
Most helpful comment
The "oraclelinux" is a good suspicion I think, the latest one before muting was this: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=oraclelinux-6/325/console