Cylc-flow: cylc clean

Created on 22 Oct 2020  路  27Comments  路  Source: cylc/cylc-flow

A new command for housekeeping workflows and their files.

  1. ~Remove stopped workflows on the local scheduler filesystem. #3961~

    • target: 8.0b0

    • Until (2) this will fail if run on a host which doesn't have access to that filesystem (i.e. no SSH logic required).

  2. ~Remove stopped workflows on all filesystems. #4017~

    • target: 8.0b0

    • ~attempt after #3827~

    • Obtain a list of platforms used platforms from the task_jobs table in the database.

    • Use the global config to reduce this to an install target mapping {install_target: [platform, ...]}.

    • Shuffle the lists of platforms to randomise them.

    • Attempt to remove workflow files using the first platform for each install target.

    • Use host selection as normal for the platform.

    • If it fails try another host, if that fails move on to the next platform.

  3. ~A targeted version of (1) & (2) e.g. delete just the log directory. #4237~

    • target: may be needed for 8.0.0 else 8.x

    • An extension of (1) and (2) to all more targeted removal of dirs within a workflow.

  4. Cycle aware housekeeping of log, work and share on running or stopped workflows.

    • target: 8.x

    • An extension of (3) which enables the removal of task files from cycles before x.

    • A direct replacement of rose_prune functionality.

    • Should be able to target tasks and cycle point ranges.

    • Intended for use in housekeep tasks, should detect and use the CYLC_TASK_CYCLE_POINT variable.

    • Closes #1159

Note: Part (4) is pending requirements gathering and implementation proposal.

Most helpful comment

As part of part 3 (targeted clean), I think that perhaps globs should not match the possible symlink dirs? E.g. if a user does cylc clean myflow --rm 'wo*', it should not remove the work directory, you would have to explicitly do --rm work.

Main reason I am asking is that it would make the implementation easier. Otherwise, as it stands, doing --rm 'wo*' removes the work symlink but not its target, whereas --rm work removes both.

Update: probably best thing to do is just rejig the logic so that --rm 'wo*' would remove the work symlink dir and its target (but not remove any targets of user-created symlinks)

All 27 comments

Note: @dpmatthews has suggested that we might not want to expose the "cycle aware housekeeping" via the CLI which makes some sense as it would be nicer to configure housekeeping in the workflow rather than having a dedicated housekeep task. So (4) might not be related to the CLI, however, we would expect it to share the same logic as the cylc clean command.

For part 1,

this will fail if run on a host which doesn't have access to that filesystem (i.e. no SSH logic required).

Without checking the DB, cylc clean foo will remove ~/cylc-run/foo on localhost, without knowing about any remote installations. I.e., it will appear to succeed. Is this okay?

It's an acceptable half-way-house until part (2) is implemented. Not good enough for 8.0.0 release.

From part 2:

  • Obtain a list of platforms used platforms from the task_jobs table in the database.
  • Use the global config to reduce this to an install target mapping {install_target: [platform, ...]}.

What if the global config has changed in between cylc run/cylc install and cylc clean, such that the install target for the workflow's particular platform is now different? This would mean the workflow dirs on the original install target won't get removed. Would it instead make sense to log the install target in the task_jobs table of the DB?

Would it instead make sense to log the install target in the task_jobs table of the DB?

If the install target has been changed for one platform then it will have been changed for all platforms (that used the same install target) so knowing what it was before the change wont be any help.

A few quick examples of platforms config and clean locations:

~Use alternative hosts/platforms in the event of SSH errors (functionality to be added in #3827)~

[platforms]
  [[foo]]
    install target = a
  [[bar]]
    install target = a

~Task ran on foo, attempt clean up on either foo or bar. If it fails due to SSH/network issues try other hosts in the foo platform else move on to bar.~

Scratch this, users might not have access to all platforms on an install target. This is a somewhat facetious case but it's simpler this way anyhoo.

Batch operations on the same install target

[platforms]
  [[foo]]
    install target = a
  [[bar]]
    install target = a

Tasks ran on foo and bar, clean up on either foo or bar.

Always use localhost where possible

[platforms]
  [[localhost]]
    install target = localhost
  [[foo]]
    install target = localhost

If the task ran on platform foo, we use localhost to clean up.

Fail for missing platforms

[platforms]
  [[foo]]

If the task ran on bar, fail, we don't know what install target bar would have used.

Skip platforms if the hosts config is provided but set to null

Separate Issue

Solve this post 8.0.0
https://github.com/cylc/cylc-flow/issues/3991

It should be possible to "retire" platforms in the config like so:

[platforms]
  [[foo]]
  [[bar]]
    hosts =  # empty host list I.E. a platform with no nodes
    install target = foo

If the task ran on bar, use foo to clean up.

Batch operations on the same install target

[platforms]
  [[foo]]
    install target = a
  [[bar]]
    install target = a

Tasks ran on foo and bar, clean up on _either_ foo _or_ bar.

I'm not sure what "clean up on a platform" means; if the install target is the same, what difference does it make to "clean up on foo" or "clean up on bar"? Is it if things like the [platforms][X]host or [platforms][X]ssh command are different between foo and bar? And if so, you're saying it doesn't matter which one to use if the install targets are the same?

[platforms]
  [[foo]]
    install target = a
    hosts = foo
  [[bar]]
    install target = a
    hosts = bar

I'm not sure what "clean up on a platform" means

Ah, ok, I mean "pick a host from that platform then invoke the clean script on that platform over SSH".

what difference does it make to "clean up on foo" or "clean up on bar"

None whatsoever, which is the point. The important thing it that we only clean up on one of them rather than both.

And if so, you're saying it doesn't matter which one to use if the install targets are the same?

Yep, stuff like ssh command is used internally by Cylc to construct SSH commands, etc. This configuration is attached to the platform not the install target.

From the team meeting today, it sounds like we'll need a --force option. However, what exactly should that do? Either

  • Remove the stopped workflow on the local filesystem even if an error occurs removing it on any remote platforms
  • As above, but also stop and remove the workflow even if it appears to be running

(I just ran into a case where the workflow stopped responding, I did Ctrl+C, it said it shut down, but the contact file was left over so it looked like it was still running)

Remove the stopped workflow on the local filesystem even if an error occurs removing it on any remote platforms

^ That one

cylc clean should never attempt to remove running workflows.

What if the contact file is left over, but the workflow is actually stopped? Should I be using a more sophisticated method than suite_files.detect_old_contact_file()?

No, detect_old_contact_file is about as sophisticated a method as is possible!

It goes to the server the flow started on, queries the process ID and checks to ensure the command matches the one the flow was started with.

Ah wait, the bug I faced was #3994

I did Ctrl+C on the unresponsive workflow and it said Suite shutting down. However, when I did cylc clean, detect_old_contact_file() raised an error, saying the workflow was still running. The contact file was still there on remote. But cylc stop on both local and remote said the workflow was already stopped. Doing ps <pid> didn't show anything. So I had to ssh to the remote and delete the contact file.

Anyway, rebasing the topic branch onto master solved this.

If the user has multiple run dirs in a dir under cylc-run, e.g.

`-- badger
    |-- foo
    |   `-- flow.cylc etc
    |-- bar
    |   `-- flow.cylc etc

What should happen if they run cylc clean badger?

I'm guessing it will have to iterate over the subdirs to find run dirs, due to the fact that the run dirs may use symlink dirs, and the database needs to be looked up for remote installs.

We could decide to support only run dirs, at least initially. Consider a follow-up to handle nesting.

With the universal ID this sort of thing may become more implicit e.g:

$ cylc install --flow-name badger/foo
$ cylc run badger/bar
$ cylc trigger 'badger/*' //20000101T00Z/mytask
$ cylc clean badger/foo
$ cylc clean 'badger/*'
$ cylc stop '*'
$ cylc clean '*'

We could decide to support only run dirs, at least initially. Consider a follow-up to handle nesting.

What if the directory has had flow.cylc deleted, for example? And the user just wants to remove it anyway? I suppose removing it anyway could be part of the behaviour of --force later on.

A bit facetious, don't need to worry about that. If they delete the flow.cylc file then it is no longer a run directory managed by Cylc.

We wouldn't expect users to do much if any manual fiddling with the Cylc managed cylc-run directory and if they do they are responsible for managing this themselves.

As far as I can see if you run cylc clean --local-only you remove your ability to subsequently remove non-local installs (since their locations are in a database cleaned by the first command.
Is this the case?
Is this desirable?

If this is the case I can see a couple of possible solutions:

  1. cylc clean --platform <name> (simple, but required the user to know where they want to clean things from. Hopefully they'll know this from the suite definition).
  2. cylc clean --platforms-from-definition <path> - Pick up platforms used from flow.cylc. (Fails if definition has changed, but hopefully not in a problematic way - if an install target isn't being used it probably won't matter from a workflow point of view - users hitting space limits might disagree!)
  3. Move the timestamped database file into ~/cylc-run/.cleaned_flows/<flow-name>-<timestamp.db>. Perhaps include option --local-hard with existing behavior. (I don't actually like this, but it's a possibility).
  4. Document this as a danger.

Is this the case?

Yes.

Is this desirable?

Not quite, but also, if you don't want that to happen don't use --local-only.

cylc clean --platform

Currently toying with this along with other things in a cylc-admin proposal, opinions welcome but do note, it's a WIP and the document is laying out a rough plan for what could be implemented rather than what will be implemented (in order to ensure the interface is forward compatible).

--local-only would be shorthand for --platform '<scheduler>'.

Maybe --local-only should not be an option. What's the use case for local clean only, as opposed to clean everything?

Covered to some extent in this proposal - https://github.com/cylc/cylc-admin/pull/118

Examples:

  • Remove the suite db to allow re-run.
  • Delete retrieved job log files on the scheduler host without bothering remote filesystems.
  • Delete a workflow locally after remote clean failed.
  • File transfer?

Maybe --local-only should not be an option. What's the use case for local clean only, as opposed to clean everything?

Even if we don't offer --local-only publicly, it needs to be there for internal use - for running cylc clean --local-only my_workflow via ssh on the remote host. But I think

  • Delete a workflow locally after remote clean failed.

is a pretty strong reason to keep in available publicly

@dpmatthews suggested a possible:

  1. Log all clean commands (if the run dir or log dir weren't cleaned)

... the thinking being that if a user does a partial clean and then restarts a workflow it's good to have some evidence of why things might not be working

As part of part 3 (targeted clean), I think that perhaps globs should not match the possible symlink dirs? E.g. if a user does cylc clean myflow --rm 'wo*', it should not remove the work directory, you would have to explicitly do --rm work.

Main reason I am asking is that it would make the implementation easier. Otherwise, as it stands, doing --rm 'wo*' removes the work symlink but not its target, whereas --rm work removes both.

Update: probably best thing to do is just rejig the logic so that --rm 'wo*' would remove the work symlink dir and its target (but not remove any targets of user-created symlinks)

The important 8.0.0 tasks have been completed pending documented follow-up issues.

Bumping the remainder of this issue back to 8.x.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

kinow picture kinow  路  4Comments

oliver-sanders picture oliver-sanders  路  4Comments

hjoliver picture hjoliver  路  5Comments

kinow picture kinow  路  4Comments

oliver-sanders picture oliver-sanders  路  3Comments