Cylc-flow: Support Slurm Heterogenous Jobs

Created on 26 Nov 2020  路  12Comments  路  Source: cylc/cylc-flow

Context
Slurm 17.11 instroduced the concept of hetrogenous jobs (a.k.a "hetjobs") where multiple sections of the same job can be run with different directives such as:

#SBATCH --nodes=28
#SBATCH hetjob
#SBATCH --nodes=7
#SBATCH hetjob
#SBATCH --nodes=1

This cannot currently be done in Cylc due to:

  • Parsec overriding duplicate directive keys.
  • Cylc job tracking logic (aka cylc message).

See proposed implementation outline - https://github.com/cylc/cylc-flow/issues/3964#issuecomment-734361717

Question:

  • [ ] Are we happy with the proposed implementation outline below?

Pull requests welcome!
This is an Open Source project - please consider contributing code yourself
(please read CONTRIBUTING.md before starting any work though).

Most helpful comment

Thanks for the quick workaround, I can confirm that this enables me to submit heterogeneous jobs and run a MPMD coupled model.

All 12 comments

[edit] read this comment first https://github.com/cylc/cylc-flow/issues/3964#issuecomment-734361717

I don't quite understand the purpose of Slurm heterogeneous jobs:

Are they:

  1. A convenient way to submit/manage multiple jobs in a single submission.

    • If so then this is really a misuse of Cylc which defeats the task-granularity which Cylc provides.

    • Cylc currently has batching logic which enables it to perform job-submission and other job management for multiple jobs simultaneously in an efficient manner.

    • Cylc currently has semantics for CLI/GUI management of multiple tasks (e.g. hold, kill, trigger can all take multiple arguments and use globing for task names).

  2. A way to run assemblages of systems on different nodes with different requirements (similar to cloud architectures).

    • If so then there are some legitimate use cases, however, they are not the best fit for Cylc.

    • This would require extensive refactoring of Cylc internals which will not be possible until Cylc9.



      • The rules that translate job-states into task states would be different.


      • The job ids would need to contain the subjob e.g. 1.1, 1.2, ...


      • The resubmission logic may require evaluation.



From the configuration perspective this is perfectly do-able:

We already have logic in parsec to allow duplicate settings/sections to combine rather than override (e.g. graph recurrence sections).

So that supporting this doesn't break existing workflows (which may rely on directives overriding through the inheritance hierarchy) we would need to create a new batch submission system called something like "slurm_hetro".

From the task/job model this is not currently possible:

We currently have a 1:1 model between tasks and active jobs. If a task has more than one active job, Cylc will consider that an error and would not permit the earlier job to affect the status of the task.

Looking forward to Cylc9 we have the plan to break the 1:1 model to allow more exotic functionality e.g. batching multiple tasks into a single job submission many:1 so this sort of functionality could fit in with these changes i.e. 1:many (and by extension the much more confusing many:many).

Ok, mulled it over, here are my thoughts:

  • Heterogeneous are a perfectly valid way of running assemblages of components potentially across nodes/clusters with good use cases.
  • Slurm gives each subjob its own unique ID.
  • However, it provides us with a single job ID representing the group as a whole for monitoring/control so we can treat this a a single job in Cylc.
  • Cylc cannot handle multiple subjobs communicating back to the scheduler and trying to take ownership of the status file.

So to implement support in Cylc I suggest:

  • Create a new batch submission system called "slurm_hetero" or something like that.

    • We will need to change the parsec rules for this section to allow redefinition of settings as we do for graph recurrences.

    • We should not extend the "slurm" batch system as this will likely break existing workflows which rely on inheritance overrides.

  • Use the first subjob in the heterogeneous job as the "lead subjob".

    • All subjobs are executed using the same script, the Cylc jobscript.

    • However, we can tell them apart using environment variables.

    • The lead subjob should be responsible for communicating back to the scheduler, updating the status file, etc.

    • We should disable the built-in cylc message calls for the other subjobs (but consider allowing custom messaging).

    • This will require adding some minimal logic to the Cylc job script. This is a bit ugly as we have historically kept specific system support isolated, however, a single env var won't hurt too much...

  • Assume subjobs are homogenous in execution time.

    • If the user configures execution time limit apply it to all subjobs in the heterogeneous submission.

So to implement support in Cylc I suggest:

Sounds good to me :+1: . That would allow a user with existing slurm heterogeneous jobs to migrate them over to Cylc without having to break them down into cylc tasks (which s/he could do later too). Furthermore, we may have the same behaviour with other batch systems, or docker-compose/swarm/k8s/cloud providers/etc.

Note slightly older versions of Slurm use #SBATCH packjob instead #SBATCH hetjob

Cylc cannot handle multiple subjobs communicating back to the scheduler and trying to take ownership of the status file.

I don't think that's gonna happen, at least not for the example given on the forum.

srun --het-group=0 --hint=nomultithread <snip> : \
    --het-group=1 --cpu-bind=cores <snip> : \
    --het-group=2 --cpu-bind=cores <snip> xios.x

If I understand correctly that's a single (albeit heterogeneous) job step that looks like a single process to the batch script. It runs xios.x three times concurrently, with the different directives, and does not exit until all instances succeed (exit 0) or at least one instance fails (exit 1). If one fails, the whole thing gets killed immediately (at least that's what I observe).

The component jobs are given different step IDs at run time e.g. 6485661+0 and 6485661+1 etc. (But they all have SLURM_STEP_ID=0 inside the executing processes, in my environment). So maybe they are technically different "job steps" (normally each use of srun is a step, I think) or not. But regardless, it doesn't matter because they're all just concurrent processes inside (conceptually) the cylc job script. We don't need to somehow run the whole cylc job script three times (is that what you're suggesting?). We just need to allow the duplicate directives to get through to the job script in the right way.

(I don't think running MPI jobs would change this, would it? c.f. running a single (non-heterogenous MPI job)).

Here's a bodge to make it work in current 7.8.x:

$ diff lib/cylc/batch_sys_handlers/slurm.py  lib/cylc/batch_sys_handlers/slurm_hetero.py
34a35
>     REC_HETJOB = re.compile(r"^hetjob_(\d+)_")
59a61
>         seen = set()
60a63,72
>             m = cls.REC_HETJOB.match(key)
>             if m:
>                 n = m.groups()[0]
>                 if n != "0" and n not in seen:
>                     lines.append("#SBATCH packjob")  # should be SBATCH hetjob in newer Slurm?
>                 seen.add(n)
>                 newkey = cls.REC_HETJOB.sub('', key)
>             else:
>                 newkey = key
> 
62c74
<                 lines.append("%s%s=%s" % (cls.DIRECTIVE_PREFIX, key, value))
---
>                 lines.append("%s%s=%s" % (cls.DIRECTIVE_PREFIX, newkey, value))
64c76
<                 lines.append("%s%s" % (cls.DIRECTIVE_PREFIX, key))
---
>                 lines.append("%s%s" % (cls.DIRECTIVE_PREFIX, newkey))

Test suite:

[scheduling]
   [[dependencies]]
      graph = foo
[runtime]
   [[foo]]
      script = """
         srun /home/oliverh/bin/hello 3 : /home/oliverh/bin/hello 10
      """
      [[[job]]]
         batch system = slurm_hetero
      [[[directives]]]
         --job-name = hetero
         hetjob_0_--mem = 1G  # "hetjob_n_" prefixes removed by the slurm_hetero handler
         hetjob_1_--mem = 3G

Resulting job script:

#!/bin/bash -l
#
# ++++ THIS IS A CYLC TASK JOB SCRIPT ++++
# Suite: foo
# Task: foo.1
# Job log directory: 1/foo/01
# Job submit method: slurm_hetero

# DIRECTIVES:
#SBATCH --job-name=hetero
#SBATCH --output=/home/oliverh/cylc-run/foo/log/job/1/foo/01/job.out
#SBATCH --error=/home/oliverh/cylc-run/foo/log/job/1/foo/01/job.err
#SBATCH --mem=1G
#SBATCH packjob
#SBATCH --mem=3G

# <SNIP>

cylc__job__inst__script() {
# SCRIPT:
srun /home/oliverh/bin/hello 3 : /home/oliverh/bin/hello 10
}

# <SNIP>

#EOF: 1/foo/01

The suite runs fine even if one of the heterogeneous "steps" fails.

The only(?) problem with my approach, perhaps, is the user has to write the srun command line in the job scripting rather than having Cylc do it for them. But that seems to be what Jeff C wants (in his forum post), and maybe it's reasonable for advanced slurm usage anyway. But I'm sure we could make the slurm_hetero handler do that as well. (Do all the srun CLI options correspond to #SBATCH directives?)

OR ... have I missed something?

We don't need to somehow run the whole cylc job script three times (is that what you're suggesting?). We just need to allow the duplicate directives to get through to the job script in the right way.

That's what I understood from Oliver comment. We would just send the heterogeneous job to Slurm, monitor the top-level PID, and fail if that PID failed.

The suite runs fine even if one of the heterogeneous "steps" fails.

That means that the Slurm job that the suite is monitoring didn't fail, even if one of the slurm heterogeneous steps failed? If so, that sounds OK to me. There must be ways for a user to change that behaviour in Slurm, I think.

The suite runs fine even if one of the heterogeneous "steps" fails.

That means that the Slurm job that the suite is monitoring didn't fail, even if one of the slurm heterogeneous steps failed?

No, sorry, I just meant the suite responds as normal to a failed task. Given that one hetjob failure results in them all getting killed by Slurm (evidently).

The only(?) problem with my approach, perhaps, is the user has to write the srun command line in the job scripting rather than having Cylc do it for them.

The issue has arisen from running coupled UM jobs where the srun command is created in the driver scripts, so for us this is perfect. Thanks.

Thanks for the quick workaround, I can confirm that this enables me to submit heterogeneous jobs and run a MPMD coupled model.

Slurm 17.11 introduced the concept of heterogenous jobs (a.k.a "hetjobs")

The associated directives, options, and variables, oringally used the name "packjob". Slurm 20-02-0-1 did (or completed?) a conversion to "hetjob".

From Slurm-20-02-0-1 RELEASE_NOTES:

-- The inconsistent terminology and environment variable naming for
Heterogeneous Job ("HetJob") support has been tidied up.
-- The correct term for these jobs are "HetJobs", references to "PackJob"
have been corrected.
-- The correct term for the separate constituent jobs are "components",
references to "packs" have been corrected.
-- Output from 'scontrol show job' and others has been made consistent
with this .
-- Relevant environment variables are all of the form SLURM_HET_JOB_*.
(Old forms are still supported for this release, but may be removed
in the future.)
-- slurm.spec - override "hardening" linker flags to ensure RHEL8 builds
in a usable manner.

Was this page helpful?
0 / 5 - 0 ratings