A new command for housekeeping workflows and their files.
task_jobs table in the database.{install_target: [platform, ...]}.log directory. #4237~log, work and share on running or stopped workflows.x.rose_prune functionality.CYLC_TASK_CYCLE_POINT variable.Note: Part (4) is pending requirements gathering and implementation proposal.
Note: @dpmatthews has suggested that we might not want to expose the "cycle aware housekeeping" via the CLI which makes some sense as it would be nicer to configure housekeeping in the workflow rather than having a dedicated housekeep task. So (4) might not be related to the CLI, however, we would expect it to share the same logic as the cylc clean command.
For part 1,
this will fail if run on a host which doesn't have access to that filesystem (i.e. no SSH logic required).
Without checking the DB, cylc clean foo will remove ~/cylc-run/foo on localhost, without knowing about any remote installations. I.e., it will appear to succeed. Is this okay?
It's an acceptable half-way-house until part (2) is implemented. Not good enough for 8.0.0 release.
From part 2:
- Obtain a list of platforms used platforms from the
task_jobstable in the database.- Use the global config to reduce this to an install target mapping
{install_target: [platform, ...]}.
What if the global config has changed in between cylc run/cylc install and cylc clean, such that the install target for the workflow's particular platform is now different? This would mean the workflow dirs on the original install target won't get removed. Would it instead make sense to log the install target in the task_jobs table of the DB?
Would it instead make sense to log the install target in the task_jobs table of the DB?
If the install target has been changed for one platform then it will have been changed for all platforms (that used the same install target) so knowing what it was before the change wont be any help.
A few quick examples of platforms config and clean locations:
[platforms]
[[foo]]
install target = a
[[bar]]
install target = a
~Task ran on foo, attempt clean up on either foo or bar. If it fails due to SSH/network issues try other hosts in the foo platform else move on to bar.~
Scratch this, users might not have access to all platforms on an install target. This is a somewhat facetious case but it's simpler this way anyhoo.
[platforms]
[[foo]]
install target = a
[[bar]]
install target = a
Tasks ran on foo and bar, clean up on either foo or bar.
[platforms]
[[localhost]]
install target = localhost
[[foo]]
install target = localhost
If the task ran on platform foo, we use localhost to clean up.
[platforms]
[[foo]]
If the task ran on bar, fail, we don't know what install target bar would have used.
Separate Issue
Solve this post 8.0.0
https://github.com/cylc/cylc-flow/issues/3991
It should be possible to "retire" platforms in the config like so:
[platforms]
[[foo]]
[[bar]]
hosts = # empty host list I.E. a platform with no nodes
install target = foo
If the task ran on bar, use foo to clean up.
Batch operations on the same install target
[platforms] [[foo]] install target = a [[bar]] install target = aTasks ran on
fooandbar, clean up on _either_foo_or_bar.
I'm not sure what "clean up on a platform" means; if the install target is the same, what difference does it make to "clean up on foo" or "clean up on bar"? Is it if things like the [platforms][X]host or [platforms][X]ssh command are different between foo and bar? And if so, you're saying it doesn't matter which one to use if the install targets are the same?
[platforms]
[[foo]]
install target = a
hosts = foo
[[bar]]
install target = a
hosts = bar
I'm not sure what "clean up on a platform" means
Ah, ok, I mean "pick a host from that platform then invoke the clean script on that platform over SSH".
what difference does it make to "clean up on foo" or "clean up on bar"
None whatsoever, which is the point. The important thing it that we only clean up on one of them rather than both.
And if so, you're saying it doesn't matter which one to use if the install targets are the same?
Yep, stuff like ssh command is used internally by Cylc to construct SSH commands, etc. This configuration is attached to the platform not the install target.
From the team meeting today, it sounds like we'll need a --force option. However, what exactly should that do? Either
(I just ran into a case where the workflow stopped responding, I did Ctrl+C, it said it shut down, but the contact file was left over so it looked like it was still running)
Remove the stopped workflow on the local filesystem even if an error occurs removing it on any remote platforms
^ That one
cylc clean should never attempt to remove running workflows.
What if the contact file is left over, but the workflow is actually stopped? Should I be using a more sophisticated method than suite_files.detect_old_contact_file()?
No, detect_old_contact_file is about as sophisticated a method as is possible!
It goes to the server the flow started on, queries the process ID and checks to ensure the command matches the one the flow was started with.
Ah wait, the bug I faced was #3994
I did Ctrl+C on the unresponsive workflow and it said Suite shutting down. However, when I did cylc clean, detect_old_contact_file() raised an error, saying the workflow was still running. The contact file was still there on remote. But cylc stop on both local and remote said the workflow was already stopped. Doing ps <pid> didn't show anything. So I had to ssh to the remote and delete the contact file.
Anyway, rebasing the topic branch onto master solved this.
If the user has multiple run dirs in a dir under cylc-run, e.g.
`-- badger
|-- foo
| `-- flow.cylc etc
|-- bar
| `-- flow.cylc etc
What should happen if they run cylc clean badger?
I'm guessing it will have to iterate over the subdirs to find run dirs, due to the fact that the run dirs may use symlink dirs, and the database needs to be looked up for remote installs.
We could decide to support only run dirs, at least initially. Consider a follow-up to handle nesting.
With the universal ID this sort of thing may become more implicit e.g:
$ cylc install --flow-name badger/foo
$ cylc run badger/bar
$ cylc trigger 'badger/*' //20000101T00Z/mytask
$ cylc clean badger/foo
$ cylc clean 'badger/*'
$ cylc stop '*'
$ cylc clean '*'
We could decide to support only run dirs, at least initially. Consider a follow-up to handle nesting.
What if the directory has had flow.cylc deleted, for example? And the user just wants to remove it anyway? I suppose removing it anyway could be part of the behaviour of --force later on.
A bit facetious, don't need to worry about that. If they delete the flow.cylc file then it is no longer a run directory managed by Cylc.
We wouldn't expect users to do much if any manual fiddling with the Cylc managed cylc-run directory and if they do they are responsible for managing this themselves.
As far as I can see if you run cylc clean --local-only you remove your ability to subsequently remove non-local installs (since their locations are in a database cleaned by the first command.
Is this the case?
Is this desirable?
If this is the case I can see a couple of possible solutions:
cylc clean --platform <name> (simple, but required the user to know where they want to clean things from. Hopefully they'll know this from the suite definition).cylc clean --platforms-from-definition <path> - Pick up platforms used from flow.cylc. (Fails if definition has changed, but hopefully not in a problematic way - if an install target isn't being used it probably won't matter from a workflow point of view - users hitting space limits might disagree!)~/cylc-run/.cleaned_flows/<flow-name>-<timestamp.db>. Perhaps include option --local-hard with existing behavior. (I don't actually like this, but it's a possibility).Is this the case?
Yes.
Is this desirable?
Not quite, but also, if you don't want that to happen don't use --local-only.
cylc clean --platform
Currently toying with this along with other things in a cylc-admin proposal, opinions welcome but do note, it's a WIP and the document is laying out a rough plan for what could be implemented rather than what will be implemented (in order to ensure the interface is forward compatible).
--local-only would be shorthand for --platform '<scheduler>'.
Maybe --local-only should not be an option. What's the use case for local clean only, as opposed to clean everything?
Covered to some extent in this proposal - https://github.com/cylc/cylc-admin/pull/118
Examples:
Maybe
--local-onlyshould not be an option. What's the use case for local clean only, as opposed to clean everything?
Even if we don't offer --local-only publicly, it needs to be there for internal use - for running cylc clean --local-only my_workflow via ssh on the remote host. But I think
- Delete a workflow locally after remote clean failed.
is a pretty strong reason to keep in available publicly
@dpmatthews suggested a possible:
... the thinking being that if a user does a partial clean and then restarts a workflow it's good to have some evidence of why things might not be working
As part of part 3 (targeted clean), I think that perhaps globs should not match the possible symlink dirs? E.g. if a user does cylc clean myflow --rm 'wo*', it should not remove the work directory, you would have to explicitly do --rm work.
Main reason I am asking is that it would make the implementation easier. Otherwise, as it stands, doing --rm 'wo*' removes the work symlink but not its target, whereas --rm work removes both.
Update: probably best thing to do is just rejig the logic so that --rm 'wo*' would remove the work symlink dir and its target (but not remove any targets of user-created symlinks)
The important 8.0.0 tasks have been completed pending documented follow-up issues.
Bumping the remainder of this issue back to 8.x.
Most helpful comment
As part of part 3 (targeted clean), I think that perhaps globs should not match the possible symlink dirs? E.g. if a user does
cylc clean myflow --rm 'wo*', it should not remove theworkdirectory, you would have to explicitly do--rm work.Main reason I am asking is that it would make the implementation easier. Otherwise, as it stands, doing
--rm 'wo*'removes theworksymlink but not its target, whereas--rm workremoves both.Update: probably best thing to do is just rejig the logic so that
--rm 'wo*'would remove theworksymlink dir and its target (but not remove any targets of user-created symlinks)