Context
Slurm 17.11 instroduced the concept of hetrogenous jobs (a.k.a "hetjobs") where multiple sections of the same job can be run with different directives such as:
#SBATCH --nodes=28
#SBATCH hetjob
#SBATCH --nodes=7
#SBATCH hetjob
#SBATCH --nodes=1
This cannot currently be done in Cylc due to:
cylc message).See proposed implementation outline - https://github.com/cylc/cylc-flow/issues/3964#issuecomment-734361717
Question:
Pull requests welcome!
This is an Open Source project - please consider contributing code yourself
(please read CONTRIBUTING.md before starting any work though).
I don't quite understand the purpose of Slurm heterogeneous jobs:
Are they:
1.1, 1.2, ...We already have logic in parsec to allow duplicate settings/sections to combine rather than override (e.g. graph recurrence sections).
So that supporting this doesn't break existing workflows (which may rely on directives overriding through the inheritance hierarchy) we would need to create a new batch submission system called something like "slurm_hetro".
We currently have a 1:1 model between tasks and active jobs. If a task has more than one active job, Cylc will consider that an error and would not permit the earlier job to affect the status of the task.
Looking forward to Cylc9 we have the plan to break the 1:1 model to allow more exotic functionality e.g. batching multiple tasks into a single job submission many:1 so this sort of functionality could fit in with these changes i.e. 1:many (and by extension the much more confusing many:many).
Ok, mulled it over, here are my thoughts:
So to implement support in Cylc I suggest:
cylc message calls for the other subjobs (but consider allowing custom messaging).execution time limit apply it to all subjobs in the heterogeneous submission.So to implement support in Cylc I suggest:
Sounds good to me :+1: . That would allow a user with existing slurm heterogeneous jobs to migrate them over to Cylc without having to break them down into cylc tasks (which s/he could do later too). Furthermore, we may have the same behaviour with other batch systems, or docker-compose/swarm/k8s/cloud providers/etc.
Note slightly older versions of Slurm use #SBATCH packjob instead #SBATCH hetjob
Cylc cannot handle multiple subjobs communicating back to the scheduler and trying to take ownership of the status file.
I don't think that's gonna happen, at least not for the example given on the forum.
srun --het-group=0 --hint=nomultithread <snip> : \
--het-group=1 --cpu-bind=cores <snip> : \
--het-group=2 --cpu-bind=cores <snip> xios.x
If I understand correctly that's a single (albeit heterogeneous) job step that looks like a single process to the batch script. It runs xios.x three times concurrently, with the different directives, and does not exit until all instances succeed (exit 0) or at least one instance fails (exit 1). If one fails, the whole thing gets killed immediately (at least that's what I observe).
The component jobs are given different step IDs at run time e.g. 6485661+0 and 6485661+1 etc. (But they all have SLURM_STEP_ID=0 inside the executing processes, in my environment). So maybe they are technically different "job steps" (normally each use of srun is a step, I think) or not. But regardless, it doesn't matter because they're all just concurrent processes inside (conceptually) the cylc job script. We don't need to somehow run the whole cylc job script three times (is that what you're suggesting?). We just need to allow the duplicate directives to get through to the job script in the right way.
(I don't think running MPI jobs would change this, would it? c.f. running a single (non-heterogenous MPI job)).
Here's a bodge to make it work in current 7.8.x:
$ diff lib/cylc/batch_sys_handlers/slurm.py lib/cylc/batch_sys_handlers/slurm_hetero.py
34a35
> REC_HETJOB = re.compile(r"^hetjob_(\d+)_")
59a61
> seen = set()
60a63,72
> m = cls.REC_HETJOB.match(key)
> if m:
> n = m.groups()[0]
> if n != "0" and n not in seen:
> lines.append("#SBATCH packjob") # should be SBATCH hetjob in newer Slurm?
> seen.add(n)
> newkey = cls.REC_HETJOB.sub('', key)
> else:
> newkey = key
>
62c74
< lines.append("%s%s=%s" % (cls.DIRECTIVE_PREFIX, key, value))
---
> lines.append("%s%s=%s" % (cls.DIRECTIVE_PREFIX, newkey, value))
64c76
< lines.append("%s%s" % (cls.DIRECTIVE_PREFIX, key))
---
> lines.append("%s%s" % (cls.DIRECTIVE_PREFIX, newkey))
Test suite:
[scheduling]
[[dependencies]]
graph = foo
[runtime]
[[foo]]
script = """
srun /home/oliverh/bin/hello 3 : /home/oliverh/bin/hello 10
"""
[[[job]]]
batch system = slurm_hetero
[[[directives]]]
--job-name = hetero
hetjob_0_--mem = 1G # "hetjob_n_" prefixes removed by the slurm_hetero handler
hetjob_1_--mem = 3G
Resulting job script:
#!/bin/bash -l
#
# ++++ THIS IS A CYLC TASK JOB SCRIPT ++++
# Suite: foo
# Task: foo.1
# Job log directory: 1/foo/01
# Job submit method: slurm_hetero
# DIRECTIVES:
#SBATCH --job-name=hetero
#SBATCH --output=/home/oliverh/cylc-run/foo/log/job/1/foo/01/job.out
#SBATCH --error=/home/oliverh/cylc-run/foo/log/job/1/foo/01/job.err
#SBATCH --mem=1G
#SBATCH packjob
#SBATCH --mem=3G
# <SNIP>
cylc__job__inst__script() {
# SCRIPT:
srun /home/oliverh/bin/hello 3 : /home/oliverh/bin/hello 10
}
# <SNIP>
#EOF: 1/foo/01
The suite runs fine even if one of the heterogeneous "steps" fails.
The only(?) problem with my approach, perhaps, is the user has to write the srun command line in the job scripting rather than having Cylc do it for them. But that seems to be what Jeff C wants (in his forum post), and maybe it's reasonable for advanced slurm usage anyway. But I'm sure we could make the slurm_hetero handler do that as well. (Do all the srun CLI options correspond to #SBATCH directives?)
OR ... have I missed something?
We don't need to somehow run the whole cylc job script three times (is that what you're suggesting?). We just need to allow the duplicate directives to get through to the job script in the right way.
That's what I understood from Oliver comment. We would just send the heterogeneous job to Slurm, monitor the top-level PID, and fail if that PID failed.
The suite runs fine even if one of the heterogeneous "steps" fails.
That means that the Slurm job that the suite is monitoring didn't fail, even if one of the slurm heterogeneous steps failed? If so, that sounds OK to me. There must be ways for a user to change that behaviour in Slurm, I think.
The suite runs fine even if one of the heterogeneous "steps" fails.That means that the Slurm job that the suite is monitoring didn't fail, even if one of the slurm heterogeneous steps failed?
No, sorry, I just meant the suite responds as normal to a failed task. Given that one hetjob failure results in them all getting killed by Slurm (evidently).
The only(?) problem with my approach, perhaps, is the user has to write the srun command line in the job scripting rather than having Cylc do it for them.
The issue has arisen from running coupled UM jobs where the srun command is created in the driver scripts, so for us this is perfect. Thanks.
Thanks for the quick workaround, I can confirm that this enables me to submit heterogeneous jobs and run a MPMD coupled model.
Slurm 17.11 introduced the concept of heterogenous jobs (a.k.a "hetjobs")
The associated directives, options, and variables, oringally used the name "packjob". Slurm 20-02-0-1 did (or completed?) a conversion to "hetjob".
From Slurm-20-02-0-1 RELEASE_NOTES:
-- The inconsistent terminology and environment variable naming for
Heterogeneous Job ("HetJob") support has been tidied up.
-- The correct term for these jobs are "HetJobs", references to "PackJob"
have been corrected.
-- The correct term for the separate constituent jobs are "components",
references to "packs" have been corrected.
-- Output from 'scontrol show job' and others has been made consistent
with this .
-- Relevant environment variables are all of the form SLURM_HET_JOB_*.
(Old forms are still supported for this release, but may be removed
in the future.)
-- slurm.spec - override "hardening" linker flags to ensure RHEL8 builds
in a usable manner.
Most helpful comment
Thanks for the quick workaround, I can confirm that this enables me to submit heterogeneous jobs and run a MPMD coupled model.