Cylc-flow: Remove named checkpointing

Created on 26 Oct 2020  Â·  16Comments  Â·  Source: cylc/cylc-flow

Supersedes #3864

Long story short:

  • no one uses it
  • Cylc 8 reflow achieves the same result better
  • it complicates the DB and code logic somewhat

So: let's get rid of it.

@MetRonnie - assigning you as you've run into this somewhat in #3863, but someone else can take it on if need be...

Most helpful comment

It's quite nice when you are trawling through the logs of some massive suite, or one where the user has repeatably restarted it in the hope that they might get a different outcome.

Could record the re(start) number as a suite parameter?

All 16 comments

On the Discourse discussion, someone has mentioned named vs automatic checkpoints:

To confirm, this is just for the named checkpoints? Not the checkpointing automatic/recovery capability? We very much use the automatic checkpointing for recovery but not the named checkpoints.

I think I can find 3 types of automatic checkpoints:

  1. restart
    https://github.com/cylc/cylc-flow/blob/50734095e7c9fa208af18ce7b65c10597e207f89/cylc/flow/scheduler.py#L426-L429
  2. reload-init
    https://github.com/cylc/cylc-flow/blob/50734095e7c9fa208af18ce7b65c10597e207f89/cylc/flow/scheduler.py#L986
  3. reload-done
    https://github.com/cylc/cylc-flow/blob/50734095e7c9fa208af18ce7b65c10597e207f89/cylc/flow/scheduler.py#L1443-L1445

What is to happen to those? Technically they are named checkpoints. If their functionality is to be kept intact, does that mean we can't removed named checkpoints?

(Adding this to the 8.0a3 milestone, unless that was not the intent)

Drop the two automatic reload checkpoints (2 and 3) as well as arbitrary checkpoints taken with cylc checkpoint.

Keep the restart checkpoint, since without that we couldn't restart from the latest state. This gets overwritten whenever anything changes, so there's only ever one of them. Reflow can't replace this (e.g. a proper restart knows about tasks that were running the scheduler went down).

As I recall the "tatest" (restart) checkpoint appears in the checkpoint_id table, but after this it will be the only entry so there's no point in keeping the table. Further, if you look in rundb.py "latest" is always treated as a special case. It points to to the task_pool table rather than task_pool_checkpoints; and suite_params rather than suite_params_checkpoints etc.

So the checkpoint_id table, and the three blah_checkpoints tables, cylc checkpoint, (and all associated code), can be removed.

(Arrgh, lost control of keyboard! - reopening!)

@oliver-sanders - I wonder if we should rename the "latest checkpoint" (/"latest_state"/"latest snapshot") tables consistently, to make it obvious what they are and distinguish from the ever-growing run history tables?

Currently:

  • broadcast_states
  • suite_params
  • task_pool

New?

  • broadcast_latest
  • suite_param_latest
  • task_pool_latest

Hmm, one thing I forgot: we also record a named checkpoint at every restart.

sqlite> select * from checkpoint_id;
0|2020-10-28T11:42:38+13:00|latest
1|2020-10-28T11:42:19+13:00|restart
2|2020-10-28T11:42:26+13:00|restart

I've never heard of anyone wanting to restart from a previous (not latest) restart state with all the broadcasts that happened to be extant at the time. So I suppose it's fine to ditch that. (However, we should warn users about broadcasts in the context of reflows: beware of any reflowed tasks that expect input via broadcasts from other pre-reflow tasks - mind you, the same is already true for warm starts. We should advise to use broadcast sparingly and to know where they are in the flow).

However, suite_params_checkpoints serves as a historical record of global parameters that only change during restart or reload:

sqlite> select * from suite_params;
uuid_str|fe7c3ac4-584a-498e-a273-2a9277970cd4
cylc_version|8.0a3.dev
UTC_mode|0
cycle_point_tz|+1300

sqlite> select * from suite_params_checkpoints;
1|uuid_str|fe7c3ac4-584a-498e-a273-2a9277970cd4
1|cylc_version|8.0a3.dev
1|UTC_mode|0
1|cycle_point_tz|+1300
2|uuid_str|fe7c3ac4-584a-498e-a273-2a9277970cd4
2|cylc_version|8.0a3.dev
2|UTC_mode|0
2|cycle_point_tz|+1300

Perhaps we should replace this with an ever-growing suite_params history table?

On the other hand, the DB does not track changes to everything in the flow config, so maybe it doesn't matter.

It would be nice from a provenance perspective to maintain this information, however, I don't think it's a biggie, can tackle later on if required.

However, suite_params_checkpoints serves as a historical record of global parameters that only change during restart or reload

Am I right in thinking only the cylc_version can change on a restart, and that actually none of them could be different between reloads?

Update: Ah wait, the initial cycle point etc could change

Yep, also the user can specify different suite variables (e.g -s FOO=BAR)

I've put up these changes: https://github.com/cylc/cylc-flow/pull/3906/files/8b1f2956de934b1d2996e3e2310c7274818c18da..0eb2b03cff3fc53822ba4293bfe4fbcbc582e9a5

So far I haven't removed the checkpoint_id and suite_params_checkpoints tables - was it decided to remove them or not?

Yes remove them. The suite_params aren’t very useful as-is (particularly without corresponding state checkpoints) and as Oliver noted we can come back to a better provenance solution later if needed.

What about keeping track of the number of restarts? Could we just have a table that consists of 1 cell that keeps the count?

Do we need to keep track of that?

Maybe (but not on this PR) a new “suite events” table that keeps a history of all restarts and interventions ...

The number of the restart gets printed to the log on every restart. Can either comment that out or get rid of it

Ok, I forgot about that. Seems to me we could get rid of it now, in lieu of a better provenance system in future. But maybe do it as a clean commit that can be backed out if others disagree! (Although I can’t think of any real use for that info at the moment)

It's quite nice when you are trawling through the logs of some massive suite, or one where the user has repeatably restarted it in the hope that they might get a different outcome.

Could record the re(start) number as a suite parameter?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

sadielbartholomew picture sadielbartholomew  Â·  4Comments

kinow picture kinow  Â·  4Comments

oliver-sanders picture oliver-sanders  Â·  5Comments

kinow picture kinow  Â·  4Comments

dpmatthews picture dpmatthews  Â·  3Comments