Elasticsearch: [ML] Nightly maintenance is not triggered

Created on 24 Sep 2019  路  5Comments  路  Source: elastic/elasticsearch

In our test cluster, we've noticed a large number of model snapshots for our long running jobs, dating back to the day of the upgrade from 6.5.1 -> 6.6.2 -> 6.7.0.
A first investigation in the cloud logs (only past 7 days available) showed that there's no maintenance task executed.

:ml >bug

Most helpful comment

The problem is much simpler than the subtle race conditions I thought of. It was introduced in 6.6: https://github.com/elastic/elasticsearch/pull/36702/files#diff-6d8a49bb3f57c032cf9c63498d1e2f89L48

It means nightly maintenance has not run for any cluster on versions 6.6.0 to 7.4.0 inclusive.

I will fix this for 6.8.4/7.4.1 (and also defend the subtle races too).

All 5 comments

Pinging @elastic/ml-core

My observations when we were discussing this this morning were:

  1. The way the daily maintenance task reschedules is that each invocation schedules the next one. So if some problem stops the task running once then it won't run again.
  2. There is a race condition in MlDailyMaintenanceService where if stop and start are called close together, with start being called before stop has finished doing its work then the stop that was called before start but finished running afterwards can cancel the newly scheduled timer. (I'm not sure it's possible in production for this to happen, but we could defend against it by making all the methods that access the cancellable variable synchronized.)
  3. Similarly there's a potential race in MlInitializationService if onMaster can be called before a call made very soon beforehand to offMaster has finished its work. That could be defended against by making all the methods that access mlDailyMaintenanceService synchronized.

To debug this issue, I did a full cluster restart which should schedule a new nightly maintenance task. But one day later, the maintenance task still did not run (7am server time).
I've also checked logs and model snapshots of a second long running cluster that has been created on 7.3.0 right away (without upgrade from an older version like the first cluster) with the same result: no sign of running the maintenance task and lots of old model snapshots.

The problem is much simpler than the subtle race conditions I thought of. It was introduced in 6.6: https://github.com/elastic/elasticsearch/pull/36702/files#diff-6d8a49bb3f57c032cf9c63498d1e2f89L48

It means nightly maintenance has not run for any cluster on versions 6.6.0 to 7.4.0 inclusive.

I will fix this for 6.8.4/7.4.1 (and also defend the subtle races too).

I opened #47103 to fix this.

The reason it didn't show up in any of our unit/integration tests is that the unit tests don't exercise the real on/off master listener functionality, and in integration tests we actually test the DeleteExpiredDataAction on the basis that we cannot have an integration test waiting for the early hours of the morning.

So in the tests in the elasticsearch repo we're testing most pieces of the jigsaw in isolation but not the full jigsaw. This means it's critical that the long running QA tests keep an eye out for whether daily maintenance is running in the early hours of each morning. If it is running then the unit/integration tests show it's probably doing the right thing. But if it's not running in the early hours of each morning then we're relying on the long running QA tests to tell us.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Praveen82 picture Praveen82  路  3Comments

clintongormley picture clintongormley  路  3Comments

DhairyashilBhosale picture DhairyashilBhosale  路  3Comments

jasontedor picture jasontedor  路  3Comments

rpalsaxena picture rpalsaxena  路  3Comments