Cylc-flow: Compute cluster or site platform awareness

Created on 9 Mar 2017 · 11Comments · Source: cylc/cylc-flow

Ref: https://groups.google.com/forum/#!topic/cylc/dFVNTeyPcrs

We could make Cylc aware of the concept of a group of HPC login nodes. If the original job submit node goes down, we could try the alternative node(s) in the event that a poll or kill command fails with host not found.

efficiency

Source

hjoliver

All 11 comments

We need to have a concept similar rose host-select in cylc, but geared towards handling login nodes of clusters. (rose host-select was designed to be a poor-person's load balancing system for a group of similar compute servers. We no longer have this requirement at our site, but we still have a requirement to randomly select login nodes of clusters these days. Clearly, this requirement should be met by better DNS routing on sites, but this is not always the case.)

I think we can do something like this:

In the global.rc, we'll have a [host-groups] section. Each entry will have something like host-group=host1, host2, .... Each host group can be assumed to share the same file system, batch system, etc.
In the suite.rc, we'll allow [runtime][TASK][remote]host=HOST-GROUP. A random host in the specified host group will be used for job submission, poll, kill, log retrieval, etc.

matthewrmshin on 10 Mar 2017

(Promoting the milestone and self assigned, to avoid this being lost in the ether.)

matthewrmshin on 10 Mar 2017

@matthewrmshin - your proposal sounds good, but you haven't explicitly addressed what to do if the (randomly) chosen host goes down. Presumably (as I suggested above) we'd need a retry-via-other-host mechanism to handle poll and kill (etc.) failures due to the target host going offline? Of course this would only work for jobs submitted to a batch scheduler (background jobs running on a particular login node are just screwed if the node goes down).

hjoliver on 11 Mar 2017

👍1

OK. We'll make sure to consider:

A host group only makes sense for a relevant batch scheduler. This must be configurable.
We will choose a random available host in the host group for job submit, poll and kill. If one host is unavailable, we'll pick the next one in the randomised list, until exhausted.

matthewrmshin on 26 Apr 2017

Change of title to allow a more general discussion of compute cluster support. (The suite host may be part of the cluster, so it is not limited to handling of login nodes.)

I can now see that global.rc should have a new clusters section that will mostly supersede the current hosts section.

# global.rc
[clusters]  # platforms?
    [[spicy]]
        login hosts = peppercorn, clove, cinnamon, fennel, star-anise
        batch system = slurm
        # and pretty much everything under a host subsection in the hosts section
    [[hedge-pea-sea]]
        login hosts = localhost
        batch system = pbs
        # and so on

With clusters, I think the following may also be relevant:

Custom job management (e.g. submit, poll, kill) commands.
List of file systems that are shared with the suite host? And other clusters?
A URL for checking the status of the cluster?
Custom logic to invoke for collecting job accounting information when a job completes?
Batch scheduler directives that should be added to all jobs?
Number of jobs a user can submit to a cluster at a given time.
Hold all tasks that target a cluster. E.g. cluster is scheduled for an outage. #2144.

matthewrmshin on 20 Oct 2017

Somewhat related to #2144 and #2528. A recent unexpected outage meant that jobs were drained from the cluster while it remained down for an extended period of time. Suites were unable to poll or kill submitted/running tasks on the cluster. It would be nice if:

Users are able to reset all jobs submitted to a cluster in a single command.
Suites are able to detect this automatically (via a site setting?) and are then able to reset all affected tasks to go into statuses like submit-failed. (See also #2394.)

matthewrmshin on 25 Jan 2018

This issue has superseded #2144, absorbing the hold by host/cluster feature request.

hjoliver on 25 Jan 2018

See also this discussion https://groups.google.com/forum/?fromgroups=#!topic/cylc/KoFhCGurLTo - we should also consider the ability to configure cluster specific environment variables or even extra custom logic.

matthewrmshin on 1 Feb 2019

Having started to consider this I think that there are some issues with describing all possible job hosts as "clusters". I have come to the view that I prefer the phrase "job platforms" which doesn't imply anything about whether we are running our jobs on a raspi0, or a desktop, or a cray.