Ref: https://groups.google.com/forum/#!topic/cylc/dFVNTeyPcrs
We could make Cylc aware of the concept of a group of HPC login nodes. If the original job submit node goes down, we could try the alternative node(s) in the event that a poll or kill command fails with host not found.
We need to have a concept similar rose host-select in cylc, but geared towards handling login nodes of clusters. (rose host-select was designed to be a poor-person's load balancing system for a group of similar compute servers. We no longer have this requirement at our site, but we still have a requirement to randomly select login nodes of clusters these days. Clearly, this requirement should be met by better DNS routing on sites, but this is not always the case.)
I think we can do something like this:
global.rc, we'll have a [host-groups] section. Each entry will have something like host-group=host1, host2, .... Each host group can be assumed to share the same file system, batch system, etc.suite.rc, we'll allow [runtime][TASK][remote]host=HOST-GROUP. A random host in the specified host group will be used for job submission, poll, kill, log retrieval, etc.(Promoting the milestone and self assigned, to avoid this being lost in the ether.)
@matthewrmshin - your proposal sounds good, but you haven't explicitly addressed what to do if the (randomly) chosen host goes down. Presumably (as I suggested above) we'd need a retry-via-other-host mechanism to handle poll and kill (etc.) failures due to the target host going offline? Of course this would only work for jobs submitted to a batch scheduler (background jobs running on a particular login node are just screwed if the node goes down).
OK. We'll make sure to consider:
Change of title to allow a more general discussion of compute cluster support. (The suite host may be part of the cluster, so it is not limited to handling of login nodes.)
I can now see that global.rc should have a new clusters section that will mostly supersede the current hosts section.
# global.rc
[clusters] # platforms?
[[spicy]]
login hosts = peppercorn, clove, cinnamon, fennel, star-anise
batch system = slurm
# and pretty much everything under a host subsection in the hosts section
[[hedge-pea-sea]]
login hosts = localhost
batch system = pbs
# and so on
With clusters, I think the following may also be relevant:
Somewhat related to #2144 and #2528. A recent unexpected outage meant that jobs were drained from the cluster while it remained down for an extended period of time. Suites were unable to poll or kill submitted/running tasks on the cluster. It would be nice if:
This issue has superseded #2144, absorbing the hold by host/cluster feature request.
See also this discussion https://groups.google.com/forum/?fromgroups=#!topic/cylc/KoFhCGurLTo - we should also consider the ability to configure cluster specific environment variables or even extra custom logic.
Having started to consider this I think that there are some issues with describing all possible job hosts as "clusters". I have come to the view that I prefer the phrase "job platforms" which doesn't imply anything about whether we are running our jobs on a raspi0, or a desktop, or a cray.
@hjoliver Can we close this issue?
I think the only outstanding issue related is #3827
Yes, good. Thanks for the reminder @wxtim