In our test cluster, we've noticed a large number of model snapshots for our long running jobs, dating back to the day of the upgrade from 6.5.1 -> 6.6.2 -> 6.7.0.
A first investigation in the cloud logs (only past 7 days available) showed that there's no maintenance task executed.
Pinging @elastic/ml-core
My observations when we were discussing this this morning were:
MlDailyMaintenanceService where if stop and start are called close together, with start being called before stop has finished doing its work then the stop that was called before start but finished running afterwards can cancel the newly scheduled timer. (I'm not sure it's possible in production for this to happen, but we could defend against it by making all the methods that access the cancellable variable synchronized.)MlInitializationService if onMaster can be called before a call made very soon beforehand to offMaster has finished its work. That could be defended against by making all the methods that access mlDailyMaintenanceService synchronized.To debug this issue, I did a full cluster restart which should schedule a new nightly maintenance task. But one day later, the maintenance task still did not run (7am server time).
I've also checked logs and model snapshots of a second long running cluster that has been created on 7.3.0 right away (without upgrade from an older version like the first cluster) with the same result: no sign of running the maintenance task and lots of old model snapshots.
The problem is much simpler than the subtle race conditions I thought of. It was introduced in 6.6: https://github.com/elastic/elasticsearch/pull/36702/files#diff-6d8a49bb3f57c032cf9c63498d1e2f89L48
It means nightly maintenance has not run for any cluster on versions 6.6.0 to 7.4.0 inclusive.
I will fix this for 6.8.4/7.4.1 (and also defend the subtle races too).
I opened #47103 to fix this.
The reason it didn't show up in any of our unit/integration tests is that the unit tests don't exercise the real on/off master listener functionality, and in integration tests we actually test the DeleteExpiredDataAction on the basis that we cannot have an integration test waiting for the early hours of the morning.
So in the tests in the elasticsearch repo we're testing most pieces of the jigsaw in isolation but not the full jigsaw. This means it's critical that the long running QA tests keep an eye out for whether daily maintenance is running in the early hours of each morning. If it is running then the unit/integration tests show it's probably doing the right thing. But if it's not running in the early hours of each morning then we're relying on the long running QA tests to tell us.
Most helpful comment
The problem is much simpler than the subtle race conditions I thought of. It was introduced in 6.6: https://github.com/elastic/elasticsearch/pull/36702/files#diff-6d8a49bb3f57c032cf9c63498d1e2f89L48
It means nightly maintenance has not run for any cluster on versions 6.6.0 to 7.4.0 inclusive.
I will fix this for 6.8.4/7.4.1 (and also defend the subtle races too).